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PREFACE 


This book is intended as a text covering the central concepts of practical optimization 
techniques. It is designed for either self-study by professionals or classroom work at 
the undergraduate or graduate level for students who have a technical background 
in engineering, mathematics, or science. Like the field of optimization itself, 
which involves many classical disciplines, the book should be useful to system 
analysts, operations researchers, numerical analysts, management scientists, and 
other specialists from the host of disciplines from which practical optimization appli- 
cations are drawn. The prerequisites for convenient use of the book are relatively 
modest; the prime requirement being some familiarity with introductory elements 
of linear algebra. Certain sections and developments do assume some knowledge 
of more advanced concepts of linear algebra, such as eigenvector analysis, or some 
background in sets of real numbers, but the text is structured so that the mainstream 
of the development can be faithfully pursued without reliance on this more advanced 
background material. 

Although the book covers primarily material that is now fairly standard, it 
is intended to reflect modern theoretical insights. These provide structure to what 
might otherwise be simply a collection of techniques and results, and this is valuable 
both as a means for learning existing material and for developing new results. One 
major insight of this type is the connection between the purely analytical character 
of an optimization problem, expressed perhaps by properties of the necessary condi- 
tions, and the behavior of algorithms used to solve a problem. This was a major 
theme of the first edition of this book and the second edition expands and further 
illustrates this relationship. 

As in the second edition, the material in this book is organized into three 
separate parts. Part I is a self-contained introduction to linear programming, a key 
component of optimization theory. The presentation in this part is fairly conven- 
tional, covering the main elements of the underlying theory of linear programming, 
many of the most effective numerical algorithms, and many of its important special 
applications. Part II, which is independent of Part I, covers the theory of uncon- 
strained optimization, including both derivations of the appropriate optimality condi- 
tions and an introduction to basic algorithms. This part of the book explores the 
general properties of algorithms and defines various notions of convergence. Part III 
extends the concepts developed in the second part to constrained optimization 
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problems. Except for a few isolated sections, this part is also independent of Part I. 
It is possible to go directly into Parts II and III omitting Part I, and, in fact, the 
book has been used in this way in many universities. Each part of the book contains 
enough material to form the basis of a one-quarter course. In either classroom use 
or for self-study, it is important not to overlook the suggested exercises at the end of 
each chapter. The selections generally include exercises of a computational variety 
designed to test one’s understanding of a particular algorithm, a theoretical variety 
designed to test one’s understanding of a given theoretical development, or of the 
variety that extends the presentation of the chapter to new applications or theoretical 
areas. One should attempt at least four or five exercises from each chapter. In 
progressing through the book it would be unusual to read straight through from 
cover to cover. Generally, one will wish to skip around. In order to facilitate this 
mode, we have indicated sections of a specialized or digressive nature with an 
asterisk*. 

There are several features of the revision represented by this third edition. In 
Part I a new Chapter 5 is devoted to a presentation of the theory and methods 
of polynomial-time algorithms for linear programming. These methods include, 
especially, interior point methods that have revolutionized linear programming. The 
first part of the book can itself serve as a modern basic text for linear programming. 
Part II includes an expanded treatment of necessary conditions, manifested by 
not only first- and second-order necessary conditions for optimality, but also by 
zeroth-order conditions that use no derivative information. This part continues to 
present the important descent methods for unconstrained problems, but there is new 
material on convergence analysis and on Newton’s methods which is frequently 
used as the workhorse of interior point methods for both linear and nonlinear 
programming. Finally, Part III now includes the global theory of necessary condi- 
tions for constrained problems, expressed as zero-th order conditions. Also interior 
point methods for general nonlinear programming are explicitly discussed within 
the sections on penalty and barrier methods. A significant addition to Part III is 
an expanded presentation of duality from both the global and local perspective. 
Finally, Chapter 15, on primal—dual methods has additional material on interior 
point methods and an introduction to the relatively new field of semidefinite 
programming, including several examples. 

We wish to thank the many students and researchers who over the years have 
given us comments concerning the second edition and those who encouraged us to 
carry out this revision. 


Stanford, California D.G.L. 
July 2007 YY; 
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Chapter1 INTRODUCTION 


1.1. OPTIMIZATION 


The concept of optimization is now well rooted as a principle underlying the analysis 
of many complex decision or allocation problems. It offers a certain degree of 
philosophical elegance that is hard to dispute, and it often offers an indispensable 
degree of operational simplicity. Using this optimization philosophy, one approaches 
a complex decision problem, involving the selection of values for a number of 
interrelated variables, by focussing attention on a single objective designed to 
quantify performance and measure the quality of the decision. This one objective is 
maximized (or minimized, depending on the formulation) subject to the constraints 
that may limit the selection of decision variable values. If a suitable single aspect 
of a problem can be isolated and characterized by an objective, be it profit or loss 
in a business setting, speed or distance in a physical problem, expected return in the 
environment of risky investments, or social welfare in the context of government 
planning, optimization may provide a suitable framework for analysis. 

It is, of course, a rare situation in which it is possible to fully represent all the 
complexities of variable interactions, constraints, and appropriate objectives when 
faced with a complex decision problem. Thus, as with all quantitative techniques 
of analysis, a particular optimization formulation should be regarded only as an 
approximation. Skill in modelling, to capture the essential elements of a problem, 
and good judgment in the interpretation of results are required to obtain meaningful 
conclusions. Optimization, then, should be regarded as a tool of conceptualization 
and analysis rather than as a principle yielding the philosophically correct solution. 

Skill and good judgment, with respect to problem formulation and interpretation 
of results, is enhanced through concrete practical experience and a thorough under- 
standing of relevant theory. Problem formulation itself always involves a tradeoff 
between the conflicting objectives of building a mathematical model sufficiently 
complex to accurately capture the problem description and building a model that is 
tractable. The expert model builder is facile with both aspects of this tradeoff. One 
aspiring to become such an expert must learn to identify and capture the important 
issues of a problem mainly through example and experience; one must learn to 
distinguish tractable models from nontractable ones through a study of available 
technique and theory and by nurturing the capability to extend existing theory to 
new situations. 


Chapter1 INTRODUCTION 


1.1. OPTIMIZATION 


The concept of optimization is now well rooted as a principle underlying the analysis 
of many complex decision or allocation problems. It offers a certain degree of 
philosophical elegance that is hard to dispute, and it often offers an indispensable 
degree of operational simplicity. Using this optimization philosophy, one approaches 
a complex decision problem, involving the selection of values for a number of 
interrelated variables, by focussing attention on a single objective designed to 
quantify performance and measure the quality of the decision. This one objective is 
maximized (or minimized, depending on the formulation) subject to the constraints 
that may limit the selection of decision variable values. If a suitable single aspect 
of a problem can be isolated and characterized by an objective, be it profit or loss 
in a business setting, speed or distance in a physical problem, expected return in the 
environment of risky investments, or social welfare in the context of government 
planning, optimization may provide a suitable framework for analysis. 

It is, of course, a rare situation in which it is possible to fully represent all the 
complexities of variable interactions, constraints, and appropriate objectives when 
faced with a complex decision problem. Thus, as with all quantitative techniques 
of analysis, a particular optimization formulation should be regarded only as an 
approximation. Skill in modelling, to capture the essential elements of a problem, 
and good judgment in the interpretation of results are required to obtain meaningful 
conclusions. Optimization, then, should be regarded as a tool of conceptualization 
and analysis rather than as a principle yielding the philosophically correct solution. 

Skill and good judgment, with respect to problem formulation and interpretation 
of results, is enhanced through concrete practical experience and a thorough under- 
standing of relevant theory. Problem formulation itself always involves a tradeoff 
between the conflicting objectives of building a mathematical model sufficiently 
complex to accurately capture the problem description and building a model that is 
tractable. The expert model builder is facile with both aspects of this tradeoff. One 
aspiring to become such an expert must learn to identify and capture the important 
issues of a problem mainly through example and experience; one must learn to 
distinguish tractable models from nontractable ones through a study of available 
technique and theory and by nurturing the capability to extend existing theory to 
new situations. 


2 Chapter 1 Introduction 


This book is centered around a certain optimization structure—that character- 
istic of linear and nonlinear programming. Examples of situations leading to this 
structure are sprinkled throughout the book, and these examples should help to 
indicate how practical problems can be often fruitfully structured in this form. The 
book mainly, however, is concerned with the development, analysis, and comparison 
of algorithms for solving general subclasses of optimization problems. This is 
valuable not only for the algorithms themselves, which enable one to solve given 
problems, but also because identification of the collection of structures they most 
effectively solve can enhance one’s ability to formulate problems. 


1.2 TYPES OF PROBLEMS 


The content of this book is divided into three major parts: Linear Programming, 
Unconstrained Problems, and Constrained Problems. The last two parts together 
comprise the subject of nonlinear programming. 


Linear Programming 


Linear programming is without doubt the most natural mechanism for formulating a 
vast array of problems with modest effort. A linear programming problem is charac- 
terized, as the name implies, by linear functions of the unknowns; the objective is 
linear in the unknowns, and the constraints are linear equalities or linear inequal- 
ities in the unknowns. One familiar with other branches of linear mathematics might 
suspect, initially, that linear programming formulations are popular because the 
mathematics is nicer, the theory is richer, and the computation simpler for linear 
problems than for nonlinear ones. But, in fact, these are not the primary reasons. 
In terms of mathematical and computational properties, there are much broader 
classes of optimization problems than linear programming problems that have elegant 
and potent theories and for which effective algorithms are available. It seems that 
the popularity of linear programming lies primarily with the formulation phase of 
analysis rather than the solution phase—and for good cause. For one thing, a great 
number of constraints and objectives that arise in practice are indisputably linear. 
Thus, for example, if one formulates a problem with a budget constraint restricting 
the total amount of money to be allocated among two different commodities, the 
budget constraint takes the form x, + x, < B, where x;,, i= 1,2, is the amount 
allocated to activity i, and B is the budget. Similarly, if the objective is, for example, 
maximum weight, then it can be expressed as w,x, + W)xX,, where w,, i = 1,2, 
is the unit weight of the commodity i. The overall problem would be expressed as 


maximize wx, + WX, 
subject to x, +x, <B 


x, >0, x, > 0, 
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which is an elementary linear program. The linearity of the budget constraint is 
extremely natural in this case and does not represent simply an approximation to a 
more general functional form. 

Another reason that linear forms for constraints and objectives are so popular 
in problem formulation is that they are often the least difficult to define. Thus, even 
if an objective function is not purely linear by virtue of its inherent definition (as in 
the above example), it is often far easier to define it as being linear than to decide 
on some other functional form and convince others that the more complex form is 
the best possible choice. Linearity, therefore, by virtue of its simplicity, often is 
selected as the easy way out or, when seeking generality, as the only functional form 
that will be equally applicable (or nonapplicable) in a class of similar problems. 

Of course, the theoretical and computational aspects do take on a somewhat 
special character for linear programming problems—the most significant devel- 
opment being the simplex method. This algorithm is developed in Chapters 2 
and 3. More recent interior point methods are nonlinear in character and these are 
developed in Chapter 5. 


Unconstrained Problems 


It may seem that unconstrained optimization problems are so devoid of struc- 
tural properties as to preclude their applicability as useful models of meaningful 
problems. Quite the contrary is true for two reasons. First, it can be argued, quite 
convincingly, that if the scope of a problem is broadened to the consideration of 
all relevant decision variables, there may then be no constraints—or put another 
way, constraints represent artificial delimitations of scope, and when the scope 
is broadened the constraints vanish. Thus, for example, it may be argued that a 
budget constraint is not characteristic of a meaningful problem formulation; since by 
borrowing at some interest rate it is always possible to obtain additional funds, and 
hence rather than introducing a budget constraint, a term reflecting the cost of funds 
should be incorporated into the objective. A similar argument applies to constraints 
describing the availability of other resources which at some cost (however great) 
could be supplemented. 

The second reason that many important problems can be regarded as having no 
constraints is that constrained problems are sometimes easily converted to uncon- 
strained problems. For instance, the sole effect of equality constraints is simply to 
limit the degrees of freedom, by essentially making some variables functions of 
others. These dependencies can sometimes be explicitly characterized, and a new 
problem having its number of variables equal to the true degree of freedom can be 
determined. As a simple specific example, a constraint of the form x, +x, = B can 
be eliminated by substituting x, = B— x, everywhere else that x, appears in the 
problem. 

Aside from representing a significant class of practical problems, the study 
of unconstrained problems, of course, provides a stepping stone toward the more 
general case of constrained problems. Many aspects of both theory and algorithms 
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are most naturally motivated and verified for the unconstrained case before 
progressing to the constrained case. 


Constrained Problems 


In spite of the arguments given above, many problems met in practice are formulated 
as constrained problems. This is because in most instances a complex problem such 
as, for example, the detailed production policy of a giant corporation, the planning 
of a large government agency, or even the design of a complex device cannot be 
directly treated in its entirety accounting for all possible choices, but instead must be 
decomposed into separate subproblems—each subproblem having constraints that 
are imposed to restrict its scope. Thus, in a planning problem, budget constraints are 
commonly imposed in order to decouple that one problem from a more global one. 
Therefore, one frequently encounters general nonlinear constrained mathematical 
programming problems. 
The general mathematical programming problem can be stated as 


minimize f(x) 
subject to h; (x) = 0, i=1,2,...,m 
g; (x) <0, JHA, 2p 


xe 8. 
In this formulation, x is an n-dimensional vector of unknowns, x = (X,,.X2,.--,%,)s 
and f, h;, i= 1,2,...,m, and 8 j= 1,2,...,7r, are real-valued functions of the 
variables x,,x,,...,x,. The set S is a subset of n-dimensional space. The function 


f is the objective function of the problem and the equations, inequalities, and set 
restrictions are constraints. 

Generally, in this book, additional assumptions are introduced in order to 
make the problem smooth in some suitable sense. For example, the functions in 
the problem are usually required to be continuous, or perhaps to have continuous 
derivatives. This ensures that small changes in x lead to small changes in other 
values associated with the problem. Also, the set S' is not allowed to be arbitrary 
but usually is required to be a connected region of n-dimensional space, rather than, 
for example, a set of distinct isolated points. This ensures that small changes in x 
can be made. Indeed, in a majority of problems treated, the set S is taken to be the 
entire space; there is no set restriction. 

In view of these smoothness assumptions, one might characterize the problems 
treated in this book as continuous variable programming, since we generally discuss 
problems where all variables and function values can be varied continuously. 
In fact, this assumption forms the basis of many of the algorithms discussed, 
which operate essentially by making a series of small movements in the unknown 
x vector. 
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1.3 SIZE OF PROBLEMS 


One obvious measure of the complexity of a programming problem is its size, 
measured in terms of the number of unknown variables or the number of constraints. 
As might be expected, the size of problems that can be effectively solved has been 
increasing with advancing computing technology and with advancing theory. Today, 
with present computing capabilities, however, it is reasonable to distinguish three 
classes of problems: small-scale problems having about five or fewer unknowns 
and constraints; intermediate-scale problems having from about five to a hundred 
or a thousand variables; and large-scale problems having perhaps thousands or even 
millions of variables and constraints. This classification is not entirely rigid, but 
it reflects at least roughly not only size but the basic differences in approach that 
accompany different size problems. As a rough rule, small-scale problems can be 
solved by hand or by a small computer. Intermediate-scale problems can be solved 
on a personal computer with general purpose mathematical programming codes. 
Large-scale problems require sophisticated codes that exploit special structure and 
usually require large computers. 

Much of the basic theory associated with optimization, particularly in nonlinear 
programming, is directed at obtaining necessary and sufficient conditions satisfied 
by a solution point, rather than at questions of computation. This theory involves 
mainly the study of Lagrange multipliers, including the Karush-Kuhn-Tucker 
Theorem and its extensions. It tremendously enhances insight into the philosophy 
of constrained optimization and provides satisfactory basic foundations for other 
important disciplines, such as the theory of the firm, consumer economics, and 
optimal control theory. The interpretation of Lagrange multipliers that accom- 
panies this theory is valuable in virtually every optimization setting. As a basis for 
computing numerical solutions to optimization, however, this theory is far from 
adequate, since it does not consider the difficulties associated with solving the 
equations resulting from the necessary conditions. 

If it is acknowledged from the outset that a given problem is too large and 
too complex to be efficiently solved by hand (and hence it is acknowledged that 
a computer solution is desirable), then one’s theory should be directed toward 
development of procedures that exploit the efficiencies of computers. In most cases 
this leads to the abandonment of the idea of solving the set of necessary conditions 
in favor of the more direct procedure of searching through the space (in an intelligent 
manner) for ever-improving points. 

Today, search techniques can be effectively applied to more or less general 
nonlinear programming problems. Problems of great size, large-scale programming 
problems, can be solved if they possess special structural characteristics, especially 
sparsity, that can be explioted by a solution method. Today linear programming 
software packages are capable of automatically identifying sparse structure within 
the input data and take advantage of this sparsity in numerical computation. It 
is now not uncommon to solve linear programs of up to a million variables and 
constraints, as long as the structure is sparse. Problem-dependent methods, where 
the structure is not automatically identified, are largely directed to transportation 
and network flow problems as discussed in Chapter 6. 
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This book focuses on the aspects of general theory that are most fruitful 
for computation in the widest class of problems. While necessary and sufficient 
conditions are examined and their application to small-scale problems is illustrated, 
our primary interest in such conditions is in their role as the core of a broader 
theory applicable to the solution of larger problems. At the other extreme, although 
some instances of structure exploitation are discussed, we focus primarily on the 
general continuous variable programming problem rather than on special techniques 
for special structures. 


1.4 ITERATIVE ALGORITHMS 
AND CONVERGENCE 


The most important characteristic of a high-speed computer is its ability to perform 
repetitive operations efficiently, and in order to exploit this basic characteristic, most 
algorithms designed to solve large optimization problems are iterative in nature. 
Typically, in seeking a vector that solves the programming problem, an initial vector xp 
is selected and the algorithm generates an improved vector x,. The process is repeated 
and a still better solution x, is found. Continuing in this fashion, a sequence of ever- 
improving points Xp, X,,..., X;,---, 18 found that approaches a solution point x*. For 
linear programming problems solved by the simplex method, the generated sequence 
is of finite length, reaching the solution point exactly after a finite (although initially 
unspecified) number of steps. For nonlinear programming problems or interior-point 
methods, the sequence generally does not ever exactly reach the solution point, but 
converges toward it. In operation, the process is terminated when a point sufficiently 
close to the solution point, for practical purposes, is obtained. 

The theory of iterative algorithms can be divided into three (somewhat 
overlapping) aspects. The first is concerned with the creation of the algorithms 
themselves. Algorithms are not conceived arbitrarily, but are based on a creative 
examination of the programming problem, its inherent structure, and the efficiencies 
of digital computers. The second aspect is the verification that a given algorithm 
will in fact generate a sequence that converges to a solution point. This aspect is 
referred to as global convergence analysis, since it addresses the important question 
of whether the algorithm, when initiated far from the solution point, will eventually 
converge to it. The third aspect is referred to as local convergence analysis or 
complexity analysis and is concerned with the rate at which the generated sequence 
of points converges to the solution. One cannot regard a problem as solved simply 
because an algorithm is known which will converge to the solution, since it may 
require an exorbitant amount of time to reduce the error to an acceptable tolerance. 
It is essential when prescribing algorithms that some estimate of the time required 
be available. It is the convergence-rate aspect of the theory that allows some 
quantitative evaluation and comparison of different algorithms, and at least crudely, 
assigns a measure of tractability to a problem, as discussed in Section 1.1. 

A modern-day technical version of Confucius’ most famous saying, and one 
which represents an underlying philosophy of this book, might be, “One good 
theory is worth a thousand computer runs.” Thus, the convergence properties of an 
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iterative algorithm can be estimated with confidence either by performing numerous 
computer experiments on different problems or by a simple well-directed theoretical 
analysis. A simple theory, of course, provides invaluable insight as well as the 
desired estimate. 

For linear programming using the simplex method, solid theoretical statements 
on the speed of convergence were elusive, because the method actually converges to 
an exact solution a finite number of steps. The question is how many steps might be 
required. This question was finally resolved when it was shown that it was possible 
for the number of steps to be exponential in the size of the program. The situation 
is different for interior point algorithms, which essentially treat the problem by 
introducing nonlinear terms, and which therefore do not generally obtain a solution 
in a finite number of steps but instead converge toward a solution. 

For nonlinear programs, including interior point methods applied to linear 
programs, it is meaningful to consider the speed of converge. There are many 
different classes of nonlinar programming algorithms, each with its own conver- 
gence characteristics. However, in many cases the convergence properties can be 
deduced analytically by fairly simple means, and this analysis is substantiated by 
computational experience. Presentation of convergence analysis, which seems to 
be the natural focal point of a theory directed at obtaining specific answers, is a 
unique feature of this book. 

There are in fact two aspects of convergence rate theory. The first is generally 
known as complexity analysis and focuses on how fast the method converges 
overall, distinguishing between polynomial time algorithms and non-polynomial 
time algorithms. The second aspect provides more detailed analysis of how fast 
the method converges in the final stages, and can provide comparisons between 
different algorithms. Both of these are treated in this book. 

The convergence rate theory presented has two somewhat surprising but definitely 
pleasing aspects. First, the theory is, for the most part, extremely simple in nature. 
Although initially one might fear that a theory aimed at predicting the speed of conver- 
gence of a complex algorithm might itself be doubly complex, in fact the associated 
convergence analysis often turns out to be exceedingly elementary, requiring only a 
line or two of calculation. Second, a large class of seemingly distinct algorithms turns 
out to have a common convergence rate. Indeed, as emphasized in the later chapters 
of the book, there is a canonical rate associated with a given programming problem 
that seems to govern the speed of convergence of many algorithms when applied to 
that problem. It is this fact that underlies the potency of the theory, allowing definitive 
comparisons among algorithms to be made even without detailed knowledge of the 
problems to which they will be applied. Together these two properties, simplicity and 
potency, assure convergence analysis a permanent position of major importance in 
mathematical programming theory. 


PART | 
LINEAR 
PROGRAMMING 


Chapter2 BASIC PROPERTIES 
OF LINEAR 
PROGRAMS 


2.1 INTRODUCTION 


A linear program (LP) is an optimization problem in which the objective function 
is linear in the unknowns and the constraints consist of linear equalities and linear 
inequalities. The exact form of these constraints may differ from one problem 
to another, but as shown below, any linear program can be transformed into the 
following standard form: 


minimize Cy X, + C,X,+...+C,%, 
subject to. Gy, X; + yX. +... +4,,xX, =D, 
Agi X + Ay X_ +... + AyX, = dy 


(1) 


An X] + An2Xz ae + AmnXn = Dy 
and x, 20,x, 20,...,x, 20, 


n 


where the b;’s, c;’s and a;;’s are fixed real constants, and the x;’s are real numbers 
to be determined. We always assume that each equation has been multiplied by 
minus unity, if necessary, so that each b, > 0. 

In more compact vector notation,’ this standard problem becomes 


minimize c’x 
(2) 


subjectto Ax=b and x>0. 


Here x is an n-dimensional column vector, ec? is an n-dimensional row vector, A is 
an m Xn matrix, and b is an m-dimensional column vector. The vector inequality 
x > 0 means that each component of x is nonnegative. 


*See Appendix A for a description of the vector notation used throughout this book. 
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Before giving some examples of areas in which linear programming problems 
arise naturally, we indicate how various other forms of linear programs can be 
converted to the standard form. 


Example 1 (Slack variables). Consider the problem 
minimize c,;X, + ¢)xX,+---+c¢,x, 


subject to. @,,X,; + ay.X%.+---+a4,xX, <d, 
Ay, X + Ay xX) +++ ++ Ay, X, Ld, 


Ani X1 + An2X2 Page GinnXn < Dy 
and x, 20,2, 20,...,x, 20, 


In this case the constraint set is determined entirely by linear inequalities. The 
problem may be alternatively expressed as 


minimize C;X, + C,X, +--+ +C,.%x, 


subject to. @,,X,; + @y.X%)+-+++a),xX,+y, =b, 
Ag, X1 + Ag9Xq H+ + Ay Xy, +y2 =b, 
Ami X + An 2X see FAnnXp Fn = Din 

and x, 20,x, 20,...,x, 20, 

and y, 20, y, > 0,...,y,, = 0. 


The new positive variables y, introduced to convert the inequalities to equalities 
are called slack variables (or more loosely, slacks). By considering the problem 
as one having n+m unknowns x,, X5,...,X,+Yj,Ya.-+->¥m» the problem takes 
the standard form. The m x (n+ m) matrix that now describes the linear equality 
constraints is of the special form [A, I] (that is, its columns can be partitioned into 
two sets; the first n columns make up the original A matrix and the last m columns 
make up an m x m identity matrix). 


Example 2 (Surplus variables). If the linear inequalities of Example 1 are reversed 
so that a typical inequality is 


Ajj X1 + Ay Xp +--+ +AinX, 2 ;, 
it is clear that this is equivalent to 


AX, + AjXq +++ + Ai X, — Y; = 0; 
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with y, > 0. Variables, such as y;, adjoined in this fashion to convert a “greater than 
or equal to” inequality to equality are called surplus variables. 


It should be clear that by suitably multiplying by minus unity, and adjoining 
slack and surplus variables, any set of linear inequalities can be converted to 
standard form if the unknown variables are restricted to be nonnegative. 


Example 3 (Free variables—first method). If a linear program is given in standard 
form except that one or more of the unknown variables is not required to be 
nonnegative, the problem can be transformed to standard form by either of two 
simple techniques. 

To describe the first technique, suppose in (1), for example, that the restriction 
x, > 0 is not present and hence x, is free to take on either positive or negative 
values. We then write 


ae, 3) 


where we require u, > 0 and v, > 0. If we substitute uw, — v, for x, everywhere in 
(1), the linearity of the constraints is preserved and all variables are now required 
to be nonnegative. The problem is then expressed in terms of the n+ 1 variables 
Uy Ufa Nog Hes sain Ree 

There is obviously a certain degree of redundancy introduced by this technique, 
however, since a constant added to u, and v, does not change x, (that is, the 
representation of a given value x, is not unique). Nevertheless, this does not hinder 


the simplex method of solution. 


Example 4 (Free variables—second method). A second approach for converting 
to standard form when x, is unconstrained in sign is to eliminate, x, together with 
one of the constraint equations. Take any one of the m equations in (1) which has 
a nonzero coefficient for x,. Say, for example, 


jy Xy + GjnXq +++ + GinXy = D;, (4) 


where a;, #0. Then x, can be expressed as a linear combination of the other 
variables plus a constant. If this expression is substituted for x, everywhere in (1), 
we are led to a new problem of exactly the same form but expressed in terms of 
the variables x,, x3,...,x, only. Furthermore, the ith equation, used to determine 
X,, is now identically zero and it too can be eliminated. This substitution scheme 
is valid since any combination of nonnegative variables x,,x3,..., x, leads to 
a feasible x, from (4), since the sign of x, is unrestricted. As a result of this 
simplification, we obtain a standard linear program having n— | variables and m— 1 
constraint equations. The value of the variable x, can be determined after solution 
through (4). 
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Example 5 (Specific case). As a specific instance of the above technique consider 
the problem 


minimize x, +3x,+4x; 


subject to) x, +2x,+%;=5 
2x, +3x,+ x3 =6 
xX, 20, x; 20. 


Since x, is free, we solve for it from the first constraint, obtaining 
x, =5—2x,— x3. (5) 


Substituting this into the objective and the second constraint, we obtain the equiv- 
alent problem (subtracting five from the objective) 


minimize x,+3.x,; 
subject to x,+x,=4 


x, 20, x, 20, 


which is a problem in standard form. After the smaller problem is solved (the 
answer is x, = 4, x, = 0) the value for x,(x, = —3) can be found from (5). 


2.2 EXAMPLES OF LINEAR PROGRAMMING 
PROBLEMS 


Linear programming has long proved its merit as a significant model of numerous 
allocation problems and economic phenomena. The continuously expanding liter- 
ature of applications repeatedly demonstrates the importance of linear programming 
as a general framework for problem formulation. In this section we present some 
classic examples of situations that have natural formulations. 


Example 1 (The diet problem). How can we determine the most economical diet 
that satisfies the basic minimum nutritional requirements for good health? Such a 
problem might, for example, be faced by the dietician of a large army. We assume 
that there are available at the market n different foods and that the jth food sells 
at a price c; per unit. In addition there are m basic nutritional ingredients and, to 
achieve a balanced diet, each individual must receive at least b, units of the ith 
nutrient per day. Finally, we assume that each unit of food j contains a;; units of 
the ith nutrient. 

If we denote by x; the number of units of food j in the diet, the problem then 
is to select the x;’s to minimize the total cost 


2.2 Examples of Linear Programming Problems 15 


CXF CyXq Fe +, Xy 
subject to the nutritional constraints 


AyjX{ + AyyQxX_ +++ +4),xX, 2 Dd, 
Ay) X1 + Ay xX) +++ + Ay, X, 2 dy 


Ani X1 = an2%X2 a eas Ann*n 2 Di, 


and the nonnegativity constraints 


on the food quantities. 

This problem can be converted to standard form by subtracting a nonnegative 
surplus variable from the left side of each of the m linear inequalities. The diet 
problem is discussed further in Chapter 4. 


Example 2 (The transportation problem). Quantities a,,a,...,d,,, respectively, 
of a certain product are to be shipped from each of m locations and received in 
amounts b,, b,,...,,, respectively, at each of n destinations. Associated with the 
shipping of a unit of product from origin i to destination j is a unit shipping 
cost c;;. It is desired to determine the amounts x,; to be shipped between each 
origin—destination pair i= 1,2,...,m; 7 =1,2,...,n; so as to satisfy the shipping 
requirements and minimize the total cost of transportation. 


To formulate this problem as a linear programming problem, we set up the 
array shown below: 


Xi X12 Min} ay 
Xo) X92" Xan} Ag 
Xm *m2 Xmn ain 
db b, b, 


The ith row in this array defines the variables associated with the ith origin, while 
the jth column in this array defines the variables associated with the jth destination. 
The problem is to place nonnegative variables x;; in this array so that the sum 
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across the ith row is a;, the sum down the jth column is b;, and the weighted sum 
Yj-1 Vii ¢ij%ij, fepresenting the transportation cost, is minimized. 
Thus, we have the linear programming problem: 


minimize 0 ¢,)x; 
ij 


subjectto )ox,=a, for i=1,2,....,m (6) 
j=l 


m 


aah for j=1,2,...,n (7) 
=I 


x; 20 for i=1,2,...,m; 


(eee 


In order that the constraints (6), (7) be consistent, we must, of course, assume 
that )yj", 4; = L’i_, 6; which corresponds to assuming that the total amount 
shipped is equal to the total amount received. 

The transportation problem is now clearly seen to be a linear programming 
problem in mn variables. The equations (6), (7) can be combined and expressed in 
matrix form in the usual manner and this results in an (m+n) x (mn) coefficient 


matrix consisting of zeros and ones only. 


Example 3 (Manufacturing problem). Suppose we own a facility that is capable 
of engaging in n different production activities, each of which produces various 
amounts of m commodities. Each activity can be operated at any level x; > 0 but 
when operated at the unity level the ith activity costs c; dollars and yields a,; units 
of the jth commodity. Assuming linearity of the production facility, if we are given 
a set of m numbers b,, b,,...,b,, describing the output requirements of the m 
commodities, and we wish to produce these at minimum cost, ours is the linear 
program (1). 


Example 4 (A warehousing problem). Consider the problem of operating a 
warehouse, by buying and selling the stock of a certain commodity, in order 
to maximize profit over a certain length of time. The warehouse has a fixed 
capacity C, and there is a cost r per unit for holding stock for one period. The 
price of the commodity is known to fluctuate over a number of time periods— 
say months. In any period the same price holds for both purchase or sale. The 
warehouse is originally empty and is required to be empty at the end of the last 
period. 

To formulate this problem, variables are introduced for each time period. In 
particular, let x, denote the level of stock in the warehouse at the beginning of 
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period 7. Let u; denote the amount bought during period 7, and let s, denote the 
amount sold during period i. If there are n periods, the problem is 


n 
maximize }°(p,8;—rx;) 
i=1 


subject to x,, =x,+u;—5; b= 1, 2,...,0— 1 
0 =X, FU, — S;, 
x,+2z =C | eA 


x,=0, x, 20,u,;>0, 5,20, z, 20. 


L 


If the constraints are written out explicitly for the case n = 3, they take the form 


—Uu,+s, | +X, =0 
—X_ — Uy + Sy +X3 =0 
X5 +2, =C 
—X3 — U3 +55 a 
x3 +23 =G 


Note that the coefficient matrix can be partitioned into blocks corresponding to the 
variables of the different time periods. The only blocks that have nonzero entries 
are the diagonal ones and the ones immediately above the diagonal. This structure 
is typical of problems involving time. 


Example 5 (Support Vector Machines). Suppose several d-dimensional data points 
are Classified into two distinct classes. For example, two-dimensional data points may 
be grade averages in science and humanities for different students. We also know 
the academic major of each student, as being in science or humanities, which serves 
as the classification. In general we have vectors a; € E’ for i= 1,2,...,n, and 
vectors b; € E‘ for j = 1,2,..., 1. We wish to find a hyperplane that separates the 
a,’s from the b;’s. Mathematically we wish to find y « E 4 and a number B such that 


a/y+B >1 for alli 
T . 
bjy+B <-—1 forall j, 


where {x : x’ y+ 6 = 0} is the desired hyperplane, and the separation is defined by 
the +1 and —1. This is a linear program. See Fig. 2.1. 


Example 6 (Combinatorial Auction). Suppose there are m mutually exclusive 
potential states and only one of them will be true at maturity. For example, the states 
may correspond to the winning horse in a race of m horses, or the value of a stock 
index, falling within m intervals. An auction organizer who establishes a parimutuel 
auction is prepared to issue contracts specifying subsets of the m possibilities 
that pay $1 if the final state is one of those designated by the contract, and zero 
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Hyperplane 


Fig. 2.1 Support vector for data classification 


otherwise. There are n participants who may place orders with the organizer for 
the purchase of such contracts. An order by the jth participant consists of an vector 
A; = (4)j, oj, +... Am ar where each component is either 0 or 1, a one indicating a 
desire to be paid if the corresponding state occurs. 

Accompanying the order is a number 77; which is the price limit the participant 
is willing to pay for one unit of the order. Finally, the participant also declares the 
maximum number q; of units he or she is willing to accept under these terms. 

The auction organizer, after receiving these various orders, must decide how 
many contracts to fill. Let x; be the number of units awarded to the jth order. Hen 
the jth participant will pay 7, x;. The total amount paid by all participants is a ib € 
where x is the vector of x,;’s and aq is the vector of prices. 

If the outcome is the ith state, the auction organizer must pay out a total 
of 5" jai UijXj = (Ax);. The organizer would like to maximize profit in the worst 
possible case, and does this by solving the problem 


maximize  ’x—max;(Ax); 
subject to x<q 
x>0. 
This problem can be expressed alternatively as selecting x and s to 
maximize 7a’x—s 
subject to Ax—1s<0 
x<q 


x>0, 
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where 1 is the vector of all 1’s. Notice that the profit will always be nonnegative, 
since x = 0 is feasible. 


2.3 BASIC SOLUTIONS 


Consider the system of equalities 


Ax =b, (8) 


where x is an n-vector, b an m-vector, and A is an m xn matrix. Suppose that 
from the n columns of A we select a set of m linearly independent columns 
(such a set exists if the rank of A is m). For notational simplicity assume that we 
select the first m columns of A and denote the m x m matrix determined by these 
columns by B. The matrix B is then nonsingular and we may uniquely solve the 
equation. 


Bx, =b (9) 


for the m-vector xy. By putting x = (xp, 0) (that is, setting the first m components 
of x equal to those of xp and the remaining components equal to zero), we obtain 
a solution to Ax = b. This leads to the following definition. 


Definition. Given the set of m simultaneous linear equations in n unknowns 
(8), let B be any nonsingular m x m submatrix made up of columns of A. 
Then, if all n —m components of x not associated with columns of B are set 
equal to zero, the solution to the resulting set of equations is said to be a basic 
solution to (8) with respect to the basis B. The components of x associated 
with columns of B are called basic variables. 


In the above definition we refer to B as a basis, since B consists of m linearly 
independent columns that can be regarded as a basis for the space E”. The basic 
solution corresponds to an expression for the vector b as a linear combination of 
these basis vectors. This interpretation is discussed further in the next section. 

In general, of course, Eq. (8) may have no basic solutions. However, we may 
avoid trivialities and difficulties of a nonessential nature by making certain elementary 
assumptions regarding the structure of the matrix A. First, we usually assume that 
n > m, that is, the number of variables x; exceeds the number of equality constraints. 
Second, we usually assume that the rows of A are linearly independent, corresponding 
to linear independence of the m equations. A linear dependency among the rows of 
A would lead either to contradictory constraints and hence no solutions to (8), or to 
a redundancy that could be eliminated. Formally, we explicitly make the following 
assumption in our development, unless noted otherwise. 


Full rank assumption. The m x n matrix A has m <n, and the m rows of A 
are linearly independent. 


Under the above assumption, the system (8) will always have a solution and, 
in fact, it will always have at least one basic solution. 
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The basic variables in a basic solution are not necessarily all nonzero. This is 
noted by the following definition. 


Definition. If one or more of the basic variables in a basic solution has value 
zero, that solution is said to be a degenerate basic solution. 


We note that in a nondegenerate basic solution the basic variables, and hence 
the basis B, can be immediately identified from the positive components of the 
solution. There is ambiguity associated with a degenerate basic solution, however, 
since the zero-valued basic and nonbasic variables can be interchanged. 

So far in the discussion of basic solutions we have treated only the equality 
constraint (8) and have made no reference to positivity constraints on the variables. 
Similar definitions apply when these constraints are also considered. Thus, consider 
now the system of constraints 


Ax=b 


x 
x>0, (10) 


W 


which represent the constraints of a linear program in standard form. 


Definition. A vector x satisfying (10) is said to be feasible for these 
constraints. A feasible solution to the constraints (10) that is also basic is said to 
be a basic feasible solution; if this solution is also a degenerate basic solution, 
it is called a degenerate basic feasible solution. 


2.4 THE FUNDAMENTAL THEOREM OF LINEAR 
PROGRAMMING 


In this section, through the fundamental theorem of linear programming, we 
establish the primary importance of basic feasible solutions in solving linear 
programs. The method of proof of the theorem is in many respects as important as 
the result itself, since it represents the beginning of the development of the simplex 
method. The theorem itself shows that it is necessary only to consider basic feasible 
solutions when seeking an optimal solution to a linear program because the optimal 
value is always achieved at such a solution. 
Corresponding to a linear program in standard form 


minimize c’x 
subject to Ax=b (11) 
x>0, 


a feasible solution to the constraints that achieves the minimum value of the 
objective function subject to those constraints is said to be an optimal feasible 
solution. If this solution is basic, it is an optimal basic feasible solution. 


Fundamental theorem of linear programming. Given a linear program in 
standard form (11) where A is an m Xn matrix of rank m, 
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i) if there is a feasible solution, there is a basic feasible solution; 
ii) if there is an optimal feasible solution, there is an optimal basic feasible 
solution. 


Proof of (i). Denote the columns of A by a,, a,,..., a,,. Suppose x = (X,, X,...,X,) 


? n 


is a feasible solution. Then, in terms of the columns of A, this solution satisfies: 


x) a; + Xy@y + ~~ + X,a, =b. 
Assume that exactly p of the variables x; are greater than zero, and for convenience, 
that they are the first p variables. Thus 

Xa; +xXy,a +---+x,a, =Db. (12) 
There are now two cases, corresponding as to whether the set a,,a),..., a, 1S 
linearly independent or linearly dependent. 


CASE 1: Assume a), @),..., 4, are linearly independent. Then clearly, p < m. If 
p =m, the solution 1s basic and the proof is complete. If p < m, then, since A has rank 
m, m— p vectors can be found from the remaining n — p vectors so that the resulting 
set of m vectors is linearly independent. (See Exercise 12.) Assigning the value zero 
to the corresponding m — p variables yields a (degenerate) basic feasible solution. 


CASE 2: Assume aj, @),...,@, are linearly dependent. Then there is a nontrivial 
linear combination of these vectors that is zero. Thus there are constants 
Vis Yas++++Yp, at least one of which can be assumed to be positive, such that 

ya, + ya, +---+y,a, = 0. (13) 


Multiplying this equation by a scalar ¢ and subtracting it from (12), we obtain 
(x, ~ ey;) a; + (x5 ~ EY>) a, set se (ms, ~~ eY,) a, =b. (14) 


This equation holds for every ¢, and for each ¢ the components x; — ey, correspond 
to a solution of the linear equalities—although they may violate x; — ey, > 0. 
Denoting y = (y), Y2,--+,Yp,0,0,...,0), we see that for any € 


x—ey (15) 


is a solution to the equalities. For ¢ = 0, this reduces to the original feasible solution. 
As € is increased from zero, the various components increase, decrease, or remain 
constant, depending upon whether the corresponding y, is negative, positive, or zero. 
Since we assume at least one y, is positive, at least one component will decrease 
as € is increased. We increase ¢€ to the first point where one or more components 
become zero. Specifically, we set 


€ = min {x;/y,: y; > O}. 
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For this value of ¢ the solution given by (15) is feasible and has at most p— 1 
positive variables. Repeating this process if necessary, we can eliminate positive 
variables until we have a feasible solution with corresponding columns that are 
linearly independent. At that point Case | applies. ff 


Proof of (ii). Let x = (x,,.x5,...,%,) be an optimal feasible solution and, as in 
the proof of (i) above, suppose there are exactly p positive variables x,, x,,...,X,. 
Again there are two cases; and Case 1, corresponding to linear independence, is 
exactly the same as before. 

Case 2 also goes exactly the same as before, but it must be shown that for any ¢€ 


the solution (15) is optimal. To show this, note that the value of the solution x — ey is 
e’x—ec’y. (16) 


For «e sufficiently small in magnitude, x — ey is a feasible solution for positive or 
negative values of ¢. Thus we conclude that cy = 0. For, if e’y 40, an € of small 
magnitude and proper sign could be determined so as to render (16) smaller than 
c’x while maintaining feasibility. This would violate the assumption of optimality 
of x and hence we must have c’y = 0. 

Having established that the new feasible solution with fewer positive compo- 
nents is also optimal, the remainder of the proof may be completed exactly as in 
part (i). i 


This theorem reduces the task of solving a linear program to that of searching 
over basic feasible solutions. Since for a problem having n variables and m 


constraints there are at most 
n n! 
m) ~~ m\(n—m)! 


basic solutions (corresponding to the number of ways of selecting m of n columns), 
there are only a finite number of possibilities. Thus the fundamental theorem yields 
an obvious, but terribly inefficient, finite search technique. By expanding upon the 
technique of proof as well as the statement of the fundamental theorem, the efficient 
simplex procedure is derived. 

It should be noted that the proof of the fundamental theorem given above is of 
a simple algebraic character. In the next section the geometric interpretation of this 
theorem is explored in terms of the general theory of convex sets. Although the 
geometric interpretation is aesthetically pleasing and theoretically important, the 
reader should bear in mind, lest one be diverted by the somewhat more advanced 
arguments employed, the underlying elementary level of the fundamental theorem. 


2.5 RELATIONS TO CONVEXITY 


Our development to this point, including the above proof of the fundamental 
theorem, has been based only on elementary properties of systems of linear 
equations. These results, however, have interesting interpretations in terms of the 
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theory of convex sets that can lead not only to an alternative derivation of the funda- 
mental theorem, but also to a clearer geometric understanding of the result. The main 
link between the algebraic and geometric theories is the formal relation between 
basic feasible solutions of linear inequalities in standard form and extreme points 
of polytopes. We establish this correspondence as follows. The reader is referred 
to Appendix B for a more complete summary of concepts related to convexity, but 
the definition of an extreme point is stated here. 


Definition. A point x in a convex set C is said to be an extreme point of C 
if there are no two distinct points x, and x, in C such that x = ax, + (1 —a@)x, 
for some a, 0O<a< 1. 


An extreme point is thus a point that does not lie strictly within a line segment 
connecting two other points of the set. The extreme points of a triangle, for example, 
are its three vertices. 


Theorem. (Equivalence of extreme points and basic solutions). Let A be an 
m xn matrix of rank m and b an m-vector. Let K be the convex polytope 
consisting of all n-vectors x satisfying 


Ax=b a7) 
17 
x>0. 


WV 


A vector x is an extreme point of K if and only if x is a basic feasible solution 


to (17). 
Proof. Suppose first that x = (%,,%),...,X>,0,0,...,0) is a basic feasible 
solution to (17). Then 


x) a; + XQ Ay _ +X, Ay, = b, 


where a,,a,,...,4,,, the first m columns of A, are linearly independent. Suppose 
that x could be expressed as a convex combination of two other points in K; say, 
x=ay+(l—a)z,0<a<1,y ¥z. Since all components of x, y, z are nonnegative 
and since 0 < @ < 1, it follows immediately that the last n — m components of y 
and z are zero. Thus, in particular, we have 


yay + 28 + - >+ Vn By =b 
and 
ZyAy HF ZyAy Fo FZ, Ay = b. 
Since the vectors a,,a),...,a,, are linearly independent, however, it follows that 
xX = y =z and hence x is an extreme point of K. 
Conversely, assume that x is an extreme point of K. Let us assume that the 


nonzero components of x are the first k components. Then 


xa, +x,a,+---+x,a, =b, 
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with x; > 0,i=1,2,...,k. To show that x is a basic feasible solution it must be 
shown that the vectors a,, a5, ..., a, are linearly independent. We do this by contra- 
diction. Suppose a,,a,,...,a, are linearly dependent. Then there is a nontrivial 


linear combination that is zero: 
yay + 2a) +--+ + ya, = 0. 


Define the n-vector y = (y,, ¥,---,%,0,0,...,0). Since x, > 0,1 <i<k, it is 
possible to select ¢ such that 


xtey>0, x—ey>0. 


We then have x = S(x+ ey)+ +(x — ey) which expresses x as a convex combination 
of two distinct vectors in K. This cannot occur, since x is an extreme point of 
K. Thus a,,a),..., a, are linearly independent and x is a basic feasible solution. 
(Although if k < m, it is a degenerate basic feasible solution.) ff 


This correspondence between extreme points and basic feasible solutions 
enables us to prove certain geometric properties of the convex polytope K defining 
the constraint set of a linear programming problem. 


Corollary 1. If the convex set K corresponding to (17) is nonempty, it has at 
least one extreme point. 


Proof. This follows from the first part of the Fundamental Theorem and the 
Equivalence Theorem above. ff 


Corollary 2. If there is a finite optimal solution to a linear programming 
problem, there is a finite optimal solution which is an extreme point of the 
constraint set. 


Corollary 3. The constraint set K corresponding to (17) possesses at most a 
finite number of extreme points. 


Proof. There are obviously only a finite number of basic solutions obtained by 
selecting m basis vectors from the n columns of A. The extreme points of K are a 
subset of these basic solutions. ff 


Finally, we come to the special case which occurs most frequently in practice 
and which in some sense is characteristic of well-formulated linear programs— 
the case where the constraint set K is nonempty and bounded. In this case we 
combine the results of the Equivalence Theorem and Corollary 3 above to obtain 
the following corollary. 


Corollary 4. If the convex polytope K corresponding to (17) is bounded, 
then K is a convex polyhedron, that is, K consists of points that are convex 
combinations of a finite number of points. 


Some of these results are illustrated by the following examples: 
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x3 


eX) 


x2 


Fig. 2.2 Feasible set for Example 1 


Example 1. Consider the constraint set in E* defined by 


xX, t+%,+%x3,=1 
x, 20, x»20, x20. 


This set is illustrated in Fig. 2.2. It has three extreme points, corresponding to the 
three basic solutions to x, +x, +x; = 1. 


Example 2. Consider the constraint set in E? defined by 


Xp +2 +2, =1 
2x,+3x, =1 
x, 20, x», 20, x,20. 
This set is illustrated in Fig. 2.3. It has two extreme points, corresponding to the 
two basic feasible solutions. Note that the system of equations itself has three basic 


solutions, (2, 1, 0), (1/2, 0, 1/2), (0, 1/3, 2/3), the first of which is not feasible. 


Example 3. Consider the constraint set in E? defined in terms of the inequalities 


8 

—x, <4 
Hig 2 
X,+ <2 
2X, <3 
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eX) 


x2 


Fig. 2.3 Feasible set for Example 2 


This set is illustrated in Fig. 2.4. We see by inspection that this set has five extreme 
points. In order to compare this example with our general results we must introduce 
slack variables to yield the equivalent set in E°: 


A basic solution for this system is obtained by setting any two variables to zero and 
solving for the remaining three. As indicated in Fig. 2.4, each edge of the figure 
corresponds to one variable being zero, and the extreme points are the points where 
two variables are zero. 

The last example illustrates that even when not expressed in standard form the 
extreme points of the set defined by the constraints of a linear program correspond 
to the possible solution points. This can be illustrated more directly by including the 
objective function in the figure as well. Suppose, for example, that in Example 3 
the objective function to be minimized is —2x, — x,. The set of points satisfying 
—2x,—Xx, =z for fixed z is a line. As z varies, different parallel lines are obtained 
as shown in Fig. 2.5. The optimal value of the linear program is the smallest value 
of z for which the corresponding line has a point in common with the feasible set. 
It should be reasonably clear, at least in two dimensions, that the points of solution 
will always include an extreme point. In the figure this occurs at the point (3/2, 
1/2) with z = —3(1/2). 
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x2 


X7=0 1 


Fig. 2.4 Feasible set for Example 3 


z=-3 
\ z=-2 
z=-l 


\ \ 


Fig. 2.5 Illustration of extreme point solution 
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2.6 EXERCISES 


1. Convert the following problems to standard form: 


a) minimize x+2y+3z 


subject to 


b) minimize x+y+z 
subject to x+2y+3z=10 
x21, y22, z21. 


2. A manufacturer wishes to produce an alloy that is, by weight, 30% metal A and 70% 
metal B. Five alloys are available at various prices as indicated below: 


Alloy i 2 3 4 5 


% A 10 25 50 75 95 
% B 90 75 50 25 5 


Price/lb | $5 $4 $3 $2 $1.50 


The desired alloy will be produced by combining some of the other alloys. The 
manufacturer wishes to find the amounts of the various alloys needed and to determine 
the least expensive combination. Formulate this problem as a linear program. 


3. An oil refinery has two sources of crude oil: a light crude that costs $35/barrel and a 
heavy crude that costs $30/barrel. The refinery produces gasoline, heating oil, and jet 
fuel from crude in the amounts per barrel indicated in the following table: 


Gasoline Heating oil Jet fuel 


Light crude 0.3 0.2 0.3 
Heavy crude 0.3 0.4 0.2 


The refinery has contracted to supply 900,000 barrels of gasoline, 800,000 barrels of 
heating oil, and 500,000 barrels of jet fuel. The refinery wishes to find the amounts of 
light and heavy crude to purchase so as to be able to meet its obligations at minimum 
cost. Formulate this problem as a linear program. 


4. A small firm specializes in making five types of spare automobile parts. Each part is 
first cast from iron in the casting shop and then sent to the finishing shop where holes 
are drilled, surfaces are turned, and edges are ground. The required worker-hours (per 
100 units) for each of the parts of the two shops are shown below: 
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Part L 2 3 a4 5 


Casting 2 1 3 3 #1 
Finishing 3 2 2 1 1 


The profits from the parts are $30, $20, $40, $25, and $10 (per 100 units), respectively. 
The capacities of the casting and finishing shops over the next month are 700 and 1000 
worker-hours, respectively. Formulate the problem of determining the quantities of each 
spare part to be made during the month so as to maximize profit. 


. Convert the following problem to standard form and solve: 


maximize x,+4x,+.x, 
subject to 2x, —2x,+x,=4 
xy =% =1 


x, 20, x3 > 0. 


. A large textile firm has two manufacturing plants, two sources of raw material, and three 
market centers. The transportation costs between the sources and the plants and between 
the plants and the markets are as follows: 


Plant 
A B 
1 |$1/ton $1.50/ton 
Source 
2 | $2/ton $1.50/ton 
Market 
1 2 3 


A |$4/ton $2/ton $1/ton 


Plant 
B | $3/ton $4/ton  $2/ton 


Ten tons are available from source | and 15 tons from source 2. The three market centers 
require 8 tons, 14 tons, and 3 tons. The plants have unlimited processing capacity. 


a) Formulate the problem of finding the shipping patterns from sources to plants to 
markets that minimizes the total transportation cost. 

b) Reduce the problem to a single standard transportation problem with two sources and 
three destinations. (Hint: Find minimum cost paths from sources to markets.) 

c) Suppose that plant A has a processing capacity of 8 tons, and plant B has a processing 
capacity of 7 tons. Show how to reduce the problem to two separate standard trans- 
portation problems. 


. A businessman is considering an investment project. The project has a lifetime of four 
years, with cash flows of —$100, 000, +$50, 000, +$70, 000, and +$30, 000 in each 
of the four years, respectively. At any time he may borrow funds at the rates of 12%, 
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22%, and 34% (total) for 1, 2, or 3 periods, respectively. He may loan funds at 10% per 
period. He calculates the present value of a project as the maximum amount of money 
he would pay now, to another party, for the project, assuming that he has no cash on 
hand and must borrow and lend to pay the other party and operate the project while 
maintaining a nonnegative cash balance after all debts are paid. Formulate the project 
valuation problem in a linear programming framework. 


Convert the following problem to a linear program in standard form: 


minimize |x|+ |y|+|z| 
subject to xty<l 
2x+z2= 3. 


A class of piecewise linear functions can be represented as f(x) = Maximum (¢/x+ 
d,,ejx+dy,..., cnxt d,). For such a function f, consider the problem 


minimize f(x) 
subject to Ax=b 
x>0. 


Show how to convert this problem to a linear programming problem. 


A small computer manufacturing company forecasts the demand over the next n months 
to be d;,i=1,2,...,. In any month it can produce r units, using regular production, 
at a cost of b dollars per unit. By using overtime, it can produce additional units at c 
dollars per unit, where c > b. The firm can store units from month to month at a cost 
of s dollars per unit per month. Formulate the problem of determining the production 
schedule that minimizes cost. (Hint: See Exercise 9.) 


Discuss the situation of a linear program that has one or more columns of the A matrix 
equal to zero. Consider both the case where the corresponding variables are required to 
be nonnegative and the case where some are free. 


Suppose that the matrix A = (a,,a),...,a,) has rank m, and that for some 
p< mM,a@,,@,...,a, are linearly independent. Show that m— p vectors from the 
remaining n— p vectors can be adjoined to form a set of m linearly independent vectors. 


Suppose that x is a feasible solution to the linear program (11), with A an m x n matrix 
of rank m. Show that there is a feasible solution y having the same value (that is, 
c’y = c’x) and having at most m+ 1 positive components. 


What are the basic solutions of Example 3, Section 2.5? 


Let S be a convex set in E” and S* a convex set in E’””. Suppose T is an m x n matrix 
that establishes a one-to-one correspondence between S and S*, i.e., for every s € S there 
is s* € S* such that Ts = s*, and for every s* € S* there is a single s € S such that Ts = s*. 
Show that there is a one-to-one correspondence between extreme points of S and S*. 
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16. Consider the two linear programming problems in Example 1, Section 2.1, one in E” 


and the other in E”*”. Show that there is a one-to-one correspondence between extreme 
points of these two problems. 
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Chapter3 THE SIMPLEX 
METHOD 


The idea of the simplex method is to proceed from one basic feasible solution 
(that is, one extreme point) of the constraint set of a problem in standard form 
to another, in such a way as to continually decrease the value of the objective 
function until a minimum is reached. The results of Chapter 2 assure us that it 
is sufficient to consider only basic feasible solutions in our search for an optimal 
feasible solution. This chapter demonstrates that an efficient method for moving 
among basic solutions to the minimum can be constructed. 

In the first five sections of this chapter the simplex machinery is developed from 
a careful examination of the system of linear equations that defines the constraints 
and the basic feasible solutions of the system. This approach, which focuses on 
individual variables and their relation to the system, is probably the simplest, but 
unfortunately is not easily expressed in compact form. In the last few sections 
of the chapter, the simplex method is viewed from a matrix theoretic approach, 
which focuses on all variables together. This more sophisticated viewpoint leads to 
a compact notational representation, increased insight into the simplex process, and 
to alternative methods for implementation. 


3.1 PIVOTS 


To obtain a firm grasp of the simplex procedure, it is essential that one first 
understand the process of pivoting in a set of simultaneous linear equations. There 
are two dual interpretations of the pivot procedure. 
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First Interpretation 


Consider the set of simultaneous linear equations 


AyXy + AyXy +... FAX, = dy 


Ay) X, + AyX, +... +Ay,x, = dy 
(1) 
An X an An2X2 pune FAnnXn = Digs 
where m <n. In matrix form we write this as 
Ax=b. (2) 


In the space E” we interpret this as a collection of m linear relations that must be 
satisfied by a vector x. Thus denoting by a’ the ith row of A we may express (1) as: 


alx=b, 
a’x = b, 
(3) 
a™x = b,,. 


This corresponds to the most natural interpretation of (1) as a set of m equations. 

If m <n and the equations are linearly independent, then there is not a unique 
solution but a whole linear variety of solutions (see Appendix B). A unique solution 
results, however, if m—m additional independent linear equations are adjoined. 
For example, we might specify n—m equations of the form e*x = 0, where eé 
is the kth unit vector (which is equivalent to x, = 0), in which case we obtain a 
basic solution to (1). Different basic solutions are obtained by imposing different 
additional equations of this special form. 

If the equations (3) are linearly independent, we may replace a given equation 
by any nonzero multiple of itself plus any linear combination of the other equations 
in the system. This leads to the well-known Gaussian reduction schemes, whereby 
multiples of equations are systematically subtracted from one another to yield either 
a triangular or canonical form. It is well known, and easily proved, that if the first 
m columns of A are linearly independent, the system (1) can, by a sequence of such 
multiplications and subtractions, be converted to the following canonical form: 


x) + Yim 1%m+1 + Yi,m42%m+2 Sys PR he VinXn = Vio 
Xo o Y2,m+1%m+1 =f Yo,m+2%m42 + aaa Y2nXn = Y20 


(4) 


Xn +E Yn,m+i%m+1 =F me + Yn.n*n = Ymo« 
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Corresponding to this canonical representation of the system, the variables x,, 
Xy,...,X,, are called basic and the other variables are nonbasic. The corresponding 
basic solution is then: 


Xx; = Yio» X_ = Y20>+++> Xm = Vmo> Xm+1 =0,...,%, =0, 


or in vector form: x = (y 9,0) where yo is m-dimensional and 0 is the (m —m)- 
dimensional zero vector. 

Actually, we relax our definition somewhat and consider a system to be in 
canonical form if, among the n variables, there are m basic ones with the property 
that each appears in only one equation, its coefficient in that equation is unity, and 
no two of these m variables appear in any one equation. This is equivalent to saying 
that a system is in canonical form if by some reordering of the equations and the 
variables it takes the form (4). 

Also it is customary, from the dictates of economy, to represent the system (4) 
by its corresponding array of coefficients or tableau: 


1 0 :-- O Yism+1 Vim+2 Yin Yio 
O 1 -- O Y2,m+1 Y2,m+2 Y2n J20 
0 0: 1 Ynymtl  Ymym42 °°" Ynn Ym 


The question solved by pivoting is this: given a system in canonical form, 
suppose a basic variable is to be made nonbasic and a nonbasic variable is to be 
made basic; what is the new canonical form corresponding to the new set of basic 
variables? The procedure is quite simple. Suppose in the canonical system (4) we 
wish to replace the basic variable x,,, 1 < p < m, by the nonbasic variable x,. This 
can be done if and only if y,,, is nonzero; it is accomplished by dividing row p by 
Ypq to get a unit coefficient for x, in the pth equation, and then subtracting suitable 
multiples of row p from each of the other rows in order to get a zero coefficient 
for x, in all other equations. This transforms the gth column of the tableau so that 
it is zero except in its pth entry (which is unity) and does not affect the columns of 
the other basic variables. Denoting the coefficients of the new system in canonical 
form by Vip we have explicitly 


bi i. of 
Ypq Yiqr ! # P (5) 


Equations (5) are the pivot equations that arise frequently in linear programming. 
The element y,,, in the original system is said to be the pivot element. 
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Example 1. Consider the system in canonical form: 


x) + xt Xs—Xe= 5S 
Xp +2x,-3x,+x,= 3 
Xq— Ay t2xe — Xe =—1. 


Let us find the basic solution having basic variables x,,x;,x,. We set up the 
coefficient array below: 


x; Xy X30 (X4 Xs X¢ 

1 0 0 @ 1 -l 5 
0 1 0 Ze. V3 1 3 
0 0 I =! 2 ‘=i -1 


The circle indicated is our first pivot element and corresponds to the replacement 
of x, by x4 as a basic variable. After pivoting we obtain the array 


xy X94 x3 4 X5 X6 

1 ) 0 1 1 -l 5 
—2 1 0 0) © 3 —7 

1 0 1 0 3 -2 4 


and again we have circled the next pivot element indicating our intention to replace 
x, by x;. We then obtain 


3 X5 X6 
3/) 1/5 0) 1 QO -2/5 18/5 
1 —3/5 7/5 


0 
-1/5 3/5 1 0 0 G/) -1/5 


x, X5 X33 X, Xs X¢ 

1 —l —2 1 0 0 4 
1 —2 =3 0 1 0 2 
1 =3 —5 0 0 1 1 


From this last canonical form we obtain the new basic solution 


x, =4, x5 = 2, xg = 1. 


Second Interpretation 


The set of simultaneous equations represented by (1) and (2) can be interpreted 
in E” as a vector equation. Denoting the columns of A by a, a,...,a, we write 


(1) as 
XA + XA +--+ +x, a, = b. (6) 
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In this interpretation we seek to express b as a linear combination of the a,’s. 

Ifm < nand the vectors a, span E” then there is not a unique solution but a whole 
family of solutions. The vector b has a unique representation, however, as a linear 
combination of a given linearly independent subset of these vectors. The corresponding 
solution with n — m x; variables set equal to zero is a basic solution to (1). 

Suppose now that we start with a system in the canonical form corresponding 
to the tableau 


10 -:- 0 Yivm 1 Yivm 2 Yin Yio 
0 1 . 0 Y2,m 1 Y2,m 2 Yon Y20 

(7) 
0 0 7 1 Ym,m+1 Yn,m+2 aia Ymn Yo 


In this case the first m vectors form a basis. Furthermore, every other vector 
represented in the tableau can be expressed as a linear combination of these basis 
vectors by simply reading the coefficients down the corresponding column. Thus 


Aaj = Vj) + Y2jA2 + + VnjAm- (8) 


The tableau can be interpreted as giving the representations of the vectors a; in 
terms of the basis; the jth column of the tableau is the representation for the vector a,. 
In particular, the expression for b in terms of the basis is given in the last column. 

We now consider the operation of replacing one member of the basis by another 
vector not already in the basis. Suppose for example we wish to replace the basis 
vector a,,, 1<p<™m, by the vector a,. Provided that the first m vectors with a, 
replaced by a, are linearly independent these vectors constitute a basis and every 
vector can be expressed as a linear combination of this new basis. To find the new 
representations of the vectors we must update the tableau. The linear independence 
condition holds if and only if y,, 4 0. 

Any vector a; can be expressed in terms of the old array through (8). For a, 
we have 


m 


a, = = Vigi or Yng@p 
i=1 


iAp 


from which we may solve for a,,, 


a,= —a,—)>> —*a,. (9) 


= yy; Vo 
a= ¥(x,- *y,,) a;+ = a,- (10) 
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Denoting the coefficients of the new tableau, which gives the linear combina- 
tions, by y;, we obtain immediately from (10) 


Jig = Vig ~ yy > Pv? i#p (11) 


Ypi 


Ypj = Yon 
These formulae are identical to (5). 

If a system of equations is not originally given in canonical form, we may put 
it into canonical form by adjoining the m unit vectors to the tableau and, starting 
with these vectors as the basis, successively replace each of them with columns of 
A using the pivot operation. 


Example 2. Suppose we wish to solve the simultaneous equations 


Xp X= %,= 5 
2x, — 3x, +23= 
—xX, + 2x, -—x,=-1. 


To obtain an original basis, we form the augmented tableau 


Qe; @ @; a a, a, Db 
0 1 1 -1 5 
0 2 -3 1 3 
1 -1 2 -1 -l 


oor 
oreo 


and replace e, by a,,e, by a), and e; by a;. The required operations are identical 
to those of Example 1. 


3.2 ADJACENT EXTREME POINTS 


In Chapter 2 it was discovered that it is only necessary to consider basic feasible 
solutions to the system 


A 


ei 12 
‘; (12) 


x 
x 


W 


when solving a linear program, and in the previous section it was demonstrated that 
the pivot operation can generate a new basic solution from an old one by replacing 
one basic variable by a nonbasic variable. It is clear, however, that although the pivot 
operation takes one basic solution into another, the nonnegativity of the solution 
will not in general be preserved. Special conditions must be satisfied in order that 
a pivot operation maintain feasibility. In this section we show how it is possible to 
select pivots so that we may transfer from one basic feasible solution to another. 
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We show that although it is not possible to arbitrarily specify the pair of 
variables whose roles are to be interchanged and expect to maintain the nonneg- 
ativity condition, it is possible to arbitrarily specify which nonbasic variable is to 
become basic and then determine which basic variable should become nonbasic. 
As is conventional, we base our derivation on the vector interpretation of the linear 
equations although the dual interpretation could alternatively be used. 


Nondegeneracy Assumption 


Many arguments in linear programming are substantially simplified upon the intro- 
duction of the following. 


Nondegeneracy assumption: Every basic feasible solution of (12) is a nonde- 
generate basic feasible solution. 


This assumption is invoked throughout our development of the simplex method, 
since when it does not hold the simplex method can break down if it is not suitably 
amended. The assumption, however, should be regarded as one made primarily for 
convenience, since all arguments can be extended to include degeneracy, and the 
simplex method itself can be easily modified to account for it. 


Determination of Vector to Leave Basis 


Suppose we have the basic feasible solution x = (x,, X5,...,%,,,0,0,...,0) or, 
equivalently, the representation 


XA +XyA +--+ +X,,a,, = db. (13) 


Under the nondegeneracy assumption, x; > 0, i= 1,2,...,m. Suppose also that 
we have decided to bring into the representation the vector a,,q > m. We have 
available a representation of a, in terms of the current basis 


a, = Vig + Yoqg@ 1° : “FE YingAm: (14) 
Multiplying (14) by a variable « > 0 and subtracting from (13), we have 
(x, ~ EY, 4) a, + (x5 ~ EYo,) a, ame (2 = EYng) an + éa, =b. (15) 


Thus, for any ¢ > 0 (15) gives b as a linear combination of at most m-+ 1 vectors. 
For ¢ = 0 we have the old basic feasible solution. As € is increased from zero, 
the coefficient of a, increases, and it is clear that for small enough «, (15) gives 
a feasible but nonbasic solution. The coefficients of the other vectors will either 
increase or decrease linearly as € is increased. If any decrease, we may set € equal 
to the value corresponding to the first place where one (or more) of the coefficients 
vanishes. That is 


e=min {x;/Yig : Vig > Of. (16) 
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In this case we have a new basic feasible solution, with the vector a, replacing the 
vector a, where p corresponds to the minimizing index in (16). If the minimum in 
(16) is achieved by more than a single index i, then the new solution is degenerate 
and any of the vectors with zero component can be regarded as the one which left 
the basis. 

If none of the y,,’s are positive, then all coefficients in the representation (15) 
increase (or remain constant) as € is increased, and no new basic feasible solution 
is obtained. We observe, however, that in this case, where none of the Vig’ are 
positive, there are feasible solutions to (12) having arbitrarily large coefficients. 
This means that the set K of feasible solutions to (12) is unbounded, and this special 
case, as we shall see, is of special significance in the simplex procedure. 

In summary, we have deduced that given a basic feasible solution and an 
arbitrary vector a,, there is either a new basic feasible solution having a, in its 
basis and one of the original vectors removed, or the set of feasible solutions is 
unbounded. 

Let us consider how the calculation of this section can be displayed in our 
tableau. We assume that corresponding to the constraints 


Ax=b 
x>0, 
we have a tableau of the form 
a a a Aan Any = Amy 7° a DD 
1 0 0 --- QO Yi,m4+1 Yi,m+2 Yin Yio 
0 1 0 0 Y2,m+1 Y2,m 2 . Y20 
0 0 1 : ‘ ‘ 
(17) 
0 0 7 1 Ym,m+1 Ym,m+2 ane Ymn Ym0 
This tableau may be the result of several pivot operations applied to the original 
tableau, but in any event, it represents a solution with basis a), a,,...,a,,. We 
assume that yg, Yoo,---»Y,0 are nonnegative, so that the corresponding basic 
solution x, = Yo, X» = Yo. -++>Xm = Ymo 1S feasible. We wish to bring into the 


basis the vector a,, q> Mm, and maintain feasibility. In order to determine which 
element in the gth column to use as pivot (and hence which vector in the basis will 
leave), we use (16) and compute the ratios x;/V;j = Yio/Yig, 1= 1,2,.--,m, select 
the smallest nonnegative ratio, and pivot on the corresponding y;,. 


Example 3. Consider the system 


a a 
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which has basis a,, a, a, yielding a basic feasible solution x = (4, 3, 1, 0,0, 0). 
Suppose we elect to bring a, into the basis. To determine which element in the 
fourth column is the appropriate pivot, we compute the three ratios: 


4/2=2, 3/1=3, 1/-1=-1 


and select the smallest nonnegative one. This gives 2 as the pivot element. The new 
tableau is 


a a a a a; a b 
1/2 0 0 1 2 3 2 
-1/2 1 0 0 0 0 #41 
1/2 0 1 0 4 4 3 


with corresponding basic feasible solution x = (0, 1, 3, 2,0, 0). 


Our derivation of the method for selecting the pivot in a given column that 
will yield a new feasible solution has been based on the vector interpretation of 
the equation Ax = b. An alternative derivation can be constructed by considering 
the dual approach that is based on the rows of the tableau rather than the columns. 
Briefly, the argument runs like this: if we decide to pivot on y,,,, then we first divide 
the pth row by the pivot element y,,, to change it to unity. In order that the new y,o 
remain positive, it is clear that we must have y,, > 0. Next we subtract multiples 
of the pth row from each other row in order to obtain zeros in the gth column. 
In this process the new elements in the last column must remain nonnegative—if 
the pivot was properly selected. The full operation is to subtract, from the ith row, 


Yig/Ypq times the pth row. This yields a new solution obtained directly from the last 
column: 


For this to remain nonnegative, it follows that x,/y,, < x;/Y;q, and hence again we 
are led to the conclusion that we select p as the index 7 minimizing x;/y;,. 


Geometrical Interpretations 


Corresponding to the two interpretations of pivoting and extreme points, developed 
algebraically, are two geometrical interpretations. The first is in activity space, the 
space where x is represented. This is perhaps the most natural space to consider, and 
it was used in Section 2.5. Here the feasible region is shown directly as a convex 
set, and basic feasible solutions are extreme points. Adjacent extreme points are 
points that lie on a common edge. 

The second geometrical interpretation is in requirements space, the space where 
the columns of A and b are represented. The fundamental relation is 


a,x, +a,x,+---+a,x, =b. 
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ay 


a3 


a4 


Fig. 3.1 Constraint representation in requirements space 


An example for m= 2, n= 4 is shown in Fig. 3.1. A feasible solution defines a 
representation of b as a positive combination of the a,’s. A basic feasible solution 
will use only m positive weights. In the figure a basic feasible solution can be 
constructed with positive weights on a, and a, because b lies between them. A 
basic feasible solution cannot be constructed with positive weights on a, and ay. 
Suppose we start with a, and a, as the initial basis. Then an adjacent basis is found 
by bringing in some other vector. If a, is brought in, then clearly a, must go out. 
On the other hand, if a, is brought in, a, must go out. 


3.3. DETERMINING A MINIMUM FEASIBLE 
SOLUTION 


In the last section we showed how it is possible to pivot from one basic feasible 
solution to another (or determine that the solution set is unbounded) by arbitrarily 
selecting a column to pivot on and then appropriately selecting the pivot in that 
column. The idea of the simplex method is to select the column so that the resulting 
new basic feasible solution will yield a lower value to the objective function than 
the previous one. This then provides the final link in the simplex procedure. By an 
elementary calculation, which is derived below, it is possible to determine which 
vector should enter the basis so that the objective value is reduced, and by another 
simple calculation, derived in the previous section, it is possible to then determine 
which vector should leave in order to maintain feasibility. 
Suppose we have a basic feasible solution 


(Xp, 0) = (Yi, Y20> ++ ++ Yo» 9,0, .--, 0) 


together with a tableau having an identity matrix appearing in the first m columns 
as shown below: 
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a, a Aa Anti a, b 
1 0 0 Yism+l Yin Yio 
0 1 0 Y2,m+1 Y20 =~ Y20 
(18) 
0 0 1 Ym,m+1 a Ymn Ymo 
The value of the objective function corresponding to any solution x is 
ZS CX HCgQXy He FCyXy, (19) 
and hence for the basic solution, the corresponding value is 
it 
i= C, By (20) 
Were = [GCS 5s. | 
Although it is natural to use the basic solution (xg, 0) when we have the tableau 
(18), it is clear that if arbitrary values are assigned to X,.4), Xm io>-++»X,» We can 
easily solve for the remaining variables as 
n 
X1 = Vio y V1 jXj 
j=m+l 
n 
X2=J9- YL 2 jXj 
j=m+1 
(21) 
n 
Xm = Yno — Ymjr*j- 
j=m+1 
Using (21) we may eliminate x,, x,,...,x,, from the general formula (19). Doing 
this we obtain 
T 
Z=CXK=Z =F (esc ~ Enis) Xintl 
+ (Gad _ 2,09) Xin42 pees (c = a) Xn (22) 
where 
Zp = Vij FY2jl2 Fo TV m+1 <j<n, (23) 


which is the fundamental relation required to determine the pivot column. The 
important point is that this equation gives the values of the objective function z 
for any solution of Ax = b in terms of the variables x,,,,,...,x,. From it we can 
determine if there is any advantage in changing the basic solution by introducing 
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one of the nonbasic variables. For example, if c; —z; is negative for some j, m+1< 
j <x, then increasing x; from zero to some oe value would decrease the 
total cost, and therefore would yield a better solution. The formulae (22) and (23) 
automatically take into account the changes that would be required in the values of 
the basic variables x,, x,,...,X,, to accommodate the change in x,. 

Let us derive these elation from a different viewpoint. Let y; a the ith column 
of the tableau. Then any solution satisfies 


x); + X7€5 sees ae Xm€in = Yo Xm4iYmti 7 %m+2¥mt2 7 °° 7 XnYn- 


Taking the inner product of this vector equation with ef, we have 


m 


n 
_ «Pr 
dix; = BY — » SjX jp 
i=1 


j=m+1 


n 
where z; = cBy;- Thus, adding x c;x; to both sides, 
j=m 


cx=z+ DI (c-2)) 3; (24) 


j=m+1 


as before. 
We now state the condition for improvement, which follows easily from the 
above observation, as a theorem. 


Theorem. (Improvement of basic feasible solution). Given a nondegenerate 
basic feasible solution with corresponding objective value Zz), suppose that for 
some j there holds c;—z; < 0. Then there is a feasible solution with objective 
value Z < Zo. If the cole a; can be substituted for some vector in the original 
basis to yield a new eens solution, this new solution will have z < Zp. 
Ifa; cannot be substituted to yield a basic feasible solution, then the solution 
set K is unbounded and the objective function can be made arbitrarily small 
(toward minus infinity). 


Proof. The result is an immediate consequence of the previous discussion. Let 
(X1, Xy,---,X,,0,0,...,0) be the basic feasible solution with objective value z, 
and suppose ¢,,.; — Z, 4, < 0. Then, in aly case, new feasible solutions can be 
constructed of the form (x/, x5,..., x) 0,0,...,0) with x),,,; > 0. Substi- 


tuting this solution in (22) we have 


m? Kinet ? 


_ / 
Lig = (Gag oa ened) Mint < 0, 


and hence z < Zz) for any such solution. It is clear that we desire to make ¥’,,, | 
as large as possible. As x/,,, is increased, the other components increase, remain 
constant, or decrease. Thus x/,,, can be increased until one x; = 0, i < m, in which 


case we obtain a new basic feasible solution, or if none of the x‘’s decrease, x’, 4, can 
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be increased without bound indicating an unbounded solution set and an objective 
value without lower bound. J 


We see that if at any stage c; —z; < 0 for some j, it is possible to make 
X; positive and decrease the objective function. The final question remaining is 
whether c; —z; 2 0 for all j implies optimality. 


Optimality Condition Theorem. If for some basic feasible solution c;—z, 20 
for all j, then that solution is optimal. 


Proof. This follows immediately from (22), since any other feasible solution must 
have x; > 0 for all i, and hence the value z of the objective will satisfy z—z, > 0. ff 


Since the constants c; —z; play such a central role in the development of the 
simplex method, it is convenient to introduce the somewhat abbreviated notation 
rj=Cj—Z; and refer to the r;’s as the relative cost coefficients or, alternatively, the 
reduced cost coefficients (both terms occur in common usage). These coefficients 
measure the cost of a variable relative to a given basis. (For notational convenience 
we extend the definition of relative cost coefficients to basic variables as well; the 
relative cost coefficient of a basic variable is zero.) 

We conclude this section by giving an economic interpretation of the relative 
cost coefficients. Let us agree to interpret the linear program 


minimize c¢’x 
subject to Ax=b 


x>0 


as a diet problem (see Section 2.2) where the nutritional requirements must be met 
exactly. A column of A gives the nutritional equivalent of a unit of a particular food. 
With a given basis consisting of, say, the first m columns of A, the corresponding 
simplex tableau shows how any food (or more precisely, the nutritional content of 
any food) can be constructed as a combination of foods in the basis. For instance, 
if carrots are not in the basis we can, using the description given by the tableau, 
construct a synthetic carrot which is nutritionally equivalent to a carrot, by an 
appropriate combination of the foods in the basis. 

In considering whether or not the solution represented by the current basis is 
optimal, we consider a certain food not in the basis—say carrots—and determine if 
it would be advantageous to bring it into the basis. This is very easily determined 
by examining the cost of carrots as compared with the cost of synthetic carrots. If 
carrots are food j, then the unit cost of carrots is c;. The cost of a unit of synthetic 
carrots is, on the other hand, 


m 


Sa > Ci Vij: 
i=1 
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If r; =c;—z; <0, it is advantageous to use real carrots in place of synthetic carrots, 
and carrots should be brought into the basis. 

In general each z; can be thought of as the price of a unit of the column a; when 
constructed from the current basis. The difference between this synthetic price and the 
direct price of that column determines whether that column should enter the basis. 


3.4 COMPUTATIONAL PROCEDURE—SIMPLEX 
METHOD 


In previous sections the theory, and indeed much of the technique, necessary for 
the detailed development of the simplex method has been established. It is only 
necessary to put it all together and illustrate it with examples. 

In this section we assume that we begin with a basic feasible solution and 
that the tableau corresponding to Ax = b is in the canonical form for this solution. 
Methods for obtaining this first basic feasible solution, when one is not obvious, 
are described in the next section. 

In addition to beginning with the array Ax = b expressed in canonical form 
corresponding to a basic feasible solution, we append a row at the bottom consisting 
of the relative cost coefficients and the negative of the current cost. The result is a 
simplex tableau. 

Thus, if we assume the basic variables are (in order) x,, x5,..., X,,, the simplex 
tableau takes the initial form shown in Fig. 3.2. 

The basic solution corresponding to this tableau is 


Yo O <igm 
x= 
' O m+l<i<n 

which we have assumed is feasible, that is, yj, > 0, i= 1,2,...,m. The corre- 
sponding value of the objective function is Zp. 

a, a ans an Anti An42 os a; pall a, b 

LO ss) O Witt Vijmg2 0° Yj oc Yin Yio 

0 1 . : . ; ; 

0 0 . Yi,m+1 Vi,m+2 cane dij Yin Yio 

0 0 1 Ynzmt+1  Ymzym+2 °°" Yj “tt Vinn Vmod 

0 0 se Tintl Fin42 wee es ae —Zo 


Fig. 3.2 Canonical simplex tableau 
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The relative cost coefficients r; indicate whether the value of the objective 
will increase or decrease if x; is pivoted into the solution. If these coefficients 
are all nonnegative, then the indicated solution is optimal. If some of them are 
negative, an improvement can be made (assuming nondegeneracy) by bringing the 
corresponding component into the solution. When more than one of the relative 
cost coefficients is negative, any one of them may be selected to determine in 
which column to pivot. Common practice is to select the most negative value. (See 
Exercise 13 for further discussion of this point.) 

Some more discussion of the relative cost coefficients and the last row of the 
tableau is warranted. We may regard z as an additional variable and 


CyX, + CgX) +++ +0,X, —Z=0 


as another equation. A basic solution to the augmented system will have m+ | basic 
variables, but we can require that z be one of them. For this reason it is not necessary 
to add a column corresponding to z, since it would always be (0, 0,...,0, 1). 
Thus, initially, a last row consisting of the c;’s and a right-hand side of zero can be 
appended to the standard array to represent this additional equation. Using standard 
pivot operations, the elements in this row corresponding to basic variables can be 
reduced to zero. This is equivalent to transforming the additional equation to the form 


Tint tX% m+ Te Vint 2%m+2 area ale Vy, Xp_ —Z = —Z- (25) 


This must be equivalent to (24), and hence the r;’s obtained are the relative cost coeffi- 
cients. Thus, the last row can be treated operationally like any other row: just start with 
c;’s and reduce the terms corresponding to basic variables to zero by row operations. 

After a column gq is selected in which to pivot, the final selection of the 
pivot element is made by computing the ratio yi9/y,, for the positive elements y,,, 
i=1,2,...,m, of the gth column and selecting the element p yielding the minimum 
ratio. Pivoting on this element will maintain feasibility as well as (assuming nonde- 
generacy) decrease the value of the objective function. If there are ties, any element 
yielding the minimum can be used. If there are no nonnegative elements in the 
column, the problem is unbounded. After updating the entire tableau with y,, as 
pivot and transforming the last row in the same manner as all other rows (except 
row q), we obtain a new tableau in canonical form. The new value of the objective 
function again appears in the lower right-hand corner of the tableau. 

The simplex algorithm can be summarized by the following steps: 


Step 0. Form a tableau as in Fig. 3.2 corresponding to a basic feasible solution. The 
relative cost coefficients can be found by row reduction. 

Step 1. If each rj > 0, stop; the current basic feasible solution is optimal. 

Step 2. Select q such that r, <0 to determine which nonbasic variable is to 
become basic. 

Step 3. Calculate the ratios yo/y;, for y,, > 0, i=1,2,...,m. If no y;, > 0, stop; 
the problem is unbounded. Otherwise, select p as the index i corresponding to 
the minimum ratio. 

Step 4. Pivot on the pgth element, updating all rows including the last. Return to 
Step I. 
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Proof that the algorithm solves the problem (again assuming nondegeneracy) is 
essentially established by our previous development. The process terminates only 
if optimality is achieved or unboundedness is discovered. If neither condition is 
discovered at a given basic solution, then the objective is strictly decreased. Since 
there are only a finite number of possible basic feasible solutions, and no basis 
repeats because of the strictly decreasing objective, the algorithm must reach a basis 
satisfying one of the two terminating conditions. 


Example 1. Maximize 3x, + x, +3x,3 subject to 


2X, + X+ x32 
X,; + 2x, +3x3 <5 
2x, +2x,+ x3 <6 
x, 20, x, 20, x, 20. 


To transform the problem into standard form so that the simplex procedure 
can be applied, we change the maximization to minimization by multiplying the 
objective function by minus one, and introduce three nonnegative slack variables 
X4, X5, X,. We then have the initial tableau 


a a a a a; a b 
@ ®M 110 02 
1 Zz @y 0 1 0 5 
2 2 10 0 1 5 
r’ —3 -1 -3 0 0 0 0 


First tableau 


The problem is already in canonical form with the three slack variables serving 
as the basic variables. We have at this point r; = c; —z; = c;, since the costs of 
the slacks are zero. Application of the criterion for selecting a column in which to 
pivot shows that any of the first three columns would yield an improved solution. 
In each of these columns the appropriate pivot element is determined by computing 
the ratios y,/y;; and selecting the smallest positive one. The three allowable pivots 
are all circled on the tableau. It is only necessary to determine one allowable pivot, 
and normally we would not bother to calculate them all. For hand calculation on 
problems of this size, however, we may wish to examine the allowable pivots and 
select one that will minimize (at least in the short run) the amount of division 
required. Thus for this example we select (1). 


et ft FO & 2 
—3 0 @ — 1 0 i 
20 <1 —2 0 F 2 
ai 0 4. 1 © Oo 2 


Second tableau 
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We note that the objective function—we are using the negative of the original 
one—has decreased from zero to minus two. Again we pivot on (1). 


®1 0 3 -1 0 1 
40: I <2 1 0 1 
=5 0 9 =f 1 1 3 
7G 0 =3 2 0 4 


Third tableau 


The value of the objective function has now decreased to minus four and we may 
pivot in either the first or fourth column. We select 5 . 


1/5 0 3/5 -1/5 0 

3/5 1 —-1/5 2/5 0 8/5 

1 0 -!il 0 1 

7/5 O 6/5 3/5 0 
Fourth tableau 


cooor 


Since the last row has no negative elements, we conclude that the solution corre- 
sponding to the fourth tableau is optimal. Thus x, = 1/5, x, =0, x3 = 8/5, x4 =0, 
xX; = 0, x, =4 is the optimal solution with a corresponding value of the (negative) 
objective of —(27/5). 


Degeneracy 


It is possible that in the course of the simplex procedure, degenerate basic feasible 
solutions may occur. Often they can be handled as a nondegenerate basic feasible 
solution. However, it is possible that after a new column gq is selected to enter the 
basis, the minimum of the ratios yj9/y,, may be zero, implying that the zero-valued 
basic variable is the one to go out. This means that the new variable x, will come 
in at zero value, the objective will not decrease, and the new basic feasible solution 
will also be degenerate. Conceivably, this process could continue for a series of 
steps until, finally, the original degenerate solution is again obtained. The result is 
a cycle that could be repeated indefinitely. 

Methods have been developed to avoid such cycles (see Exercises 15-17 
for a full discussion of one of them, which is based on perturbing the problem 
slightly so that zero-valued variables are actually small positive values, and 
Exercise 32 for Bland’s rule, which is simpler). In practice, however, such proce- 
dures are found to be unnecessary. When degenerate solutions are encountered, the 
simplex procedure generally does not enter a cycle. However, anticycling proce- 
dures are simple, and many codes incorporate such a procedure for the sake of 
safety. 
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3.5 ARTIFICIAL VARIABLES 


A basic feasible solution is sometimes immediately available for linear programs. 
For example, in problems with constraints of the form 


Ax <b 
x>0 (26) 


with b > 0, a basic feasible solution to the corresponding standard form of the 
problem is provided by the slack variables. This provides a means for initiating the 
simplex procedure. The example in the last section was of this type. An initial basic 
feasible solution is not always apparent for other types of linear programs, however, 
and it is necessary to develop a means for determining one so that the simplex 
method can be initiated. Interestingly (and fortunately), an auxiliary linear program 
and corresponding application of the simplex method can be used to determine the 
required initial solution. 
By elementary straightforward operations the constraints of a linear 
programming problem can always be expressed in the form 
Ax=b 
x>0 


W 


(27) 
with b > 0. In order to find a solution to (27) consider the (artificial) minimization 
problem 


m 


minimize ) y, 
i=l 


subject to Ax+y=b (28) 
x>0 
y20 
where y = (y,, y2,---»¥m) iS a vector of artificial variables. If there is a feasible 


solution to (27), then it is clear that (28) has a minimum value of zero with y = 0. If 
(27) has no feasible solution, then the minimum value of (28) is greater than zero. 

Now (28) is itself a linear program in the variables x, y, and the system is 
already in canonical form with basic feasible solution y = b. If (28) is solved using 
the simplex technique, a basic feasible solution is obtained at each step. If the 
minimum value of (28) is zero, then the final basic solution will have all y,; = 0, 
and hence barring degeneracy, the final solution will have no y, variables basic. 
If in the final solution some y, are both zero and basic, indicating a degenerate 
solution, these basic variables can be exchanged for nonbasic x; variables (again at 
zero level) to yield a basic feasible solution involving x variables only. (However, 
the situation is more complex if A is not of full rank. See Exercise 21.) 
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Example 1. Find a basic feasible solution to 
2x, +X, +2x,=4 
3x, +3x, +x, =3 


x, 20, x, 20, x, 20. 


We introduce artificial variables x, > 0, x; > 0 and an objective function x,+ x5. 
The initial tableau is 


Xi XxX, Xq x4 Xs b 
2 1 2 1 0 4 
3 3 1 0 1 3 
ce ) 0 0 1 1 0 


Initial tableau 


A basic feasible solution to the expanded system is given by the artificial variables. 
To initiate the simplex procedure we must update the last row so that it has zero 
components under the basic variables. This yields: 


2 1 > 4 4 
@ 3 t @ 4 3 
r? 5 4 5 © 6 7 


First tableau 


Pivoting in the column having the most negative bottom row component as 


indicated, we obtain: 
O: =i i 23 3 


1 1 173 0 1/3 1 
0 1 4/3 0 53° 


Second tableau 


In the second tableau there is only one choice for pivot, and it leads to the final 
tableau shown. 


0) —3/4 1 3/4 -1/2 3/2 

1 5/4 0 -1/4 1/2 1/2 

0) 0) 0 1 1 0 
Final tableau 


Both of the artificial variables have been driven out of the basis, thus reducing the 
value of the objective function to zero and leading to the basic feasible solution to 
the original problem 


x,=1/2, x,=0, x, =3/2. 
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Using artificial variables, we attack a general linear programming problem by 
use of the two-phase method. This method consists simply of a phase I in which 
artificial variables are introduced as above and a basic feasible solution is found 
(or it is determined that no feasible solutions exist); and a phase II in which, using 
the basic feasible solution resulting from phase I, the original objective function 
is minimized. During phase IJ the artificial variables and the objective function of 
phase I are omitted. Of course, in phase I artificial variables need be introduced 
only in those equations that do not contain slack variables. 


Example 2. Consider the problem 


minimize 4x,+x,+.x, 
subject to 2x, +2%,+2x,=4 
3x, +3x,+x3 =3 


x, 20, x20, x,20. 


There is no basic feasible solution apparent, so we use the two-phase method. The 
first phase was done in Example | for these constraints, so we shall not repeat it 
here. We give only the final tableau with the columns corresponding to the artificial 
variables deleted, since they are not used in phase II. We use the new cost function 
in place of the old one. Temporarily writing ce”? in the bottom row we have 


x4 Xy X3 b 

0 —3/4 1 3/2 

1 5/4. 0 1/2 
ce’ 4 1 1 0 


Initial tableau 


Transforming the last row so that zeros appear in the basic columns, we have 


0 —3/4 1 3/2 
i 0 1/2 
0 —-13/4 O —7/2 
First tableau 

3/5 0 1 9/5 
4/5 1 0 2/5 

13/5 0 0 -I11/5 


Second tableau 


and hence the optimal solution is x, = 0, x, = 2/5, x; =9/5. 
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Example 3. (A free variable problem). 


minimize —2x,+4x,+7x,+ x,+5x; 


subject to = —x, + x, +2x34+ x4,+2x5=7 
—x, +2x,+3x3+ X4+ Xx5=6 
—xX, + xX, + x, 4+2x,+ x5 =4 
x, free, x, 20, »x320, x,20, x520. 


Since x, is free, it can be eliminated, as described in Chapter 2, by solving for 
x, in terms of the other variables from the first equation and substituting everywhere 
else. This can all be done with the simplex tableau as follows: 


xy Xo =X; X, %xX Db 

-@ 1 2 1 2 7 

-1 2 3 1 1 6 

-1 1 1 2 1 4 

ce 2 4 7 1 5 0 


Initial tableau 


We select any nonzero element in the first column to pivot on—this will eliminate 
X,. 


1 1 2 1 2 7 
0 1 1 0 -!l -1 
0 0 -l 1 -l =3 
0 2 -1 1 —-14 


Equivalent problem 


We now save the first row for future reference, but our linear program only 
involves the sub-tableau indicated. There is no obvious basic feasible solution for 
this problem, so we introduce artificial variables x, and x7. 


Xp X30 Xy X5 XG OX] b 
—1 =1 0 1 1 0O 1 
0 1. =] 1 oO 1 3 
c’ 0 0 0 0 1 1 0 

Initial tableau for phase I 

Transforming the last row appropriately we obtain 

Xy =X; Xy Xs Xe Xy7 b 
=] =] 0 @ 1 0 1 
0 1 -l 1 oO 1 3 
r 1 0 1 —-2 0 0O -4 


First tableau—phase I 
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Xo x3 X4 Xs X6 X74 b 
-1 -1 0 1 1 0 1 
@) 2 -1 0 -1 1 2 
-1 —2 1 0 2 0 —2 
Second tableau—phase I 

1 -1 1 0 1 
1 2 —1 0 -1 1 2 
0 0 0 1 1 0 


Final tableau—phase I 


Now we go back to the equivalent reduced problem 


x5 X5 X4 Xs b 

0 1 -1 1 3 

1 2 -1 0 2 

c’ 2 3 -1 1 —14 


Initial tableau—phase IT 


Transforming the last row appropriately we proceed with: 


0 tf <4 1 3 
i @ =! 0 a 
6 —2 2 0) -=21 


First tableau—phase IT 


=—1% 0 17% 1 9 
(2 1 =i 2% 1 
1 0 1 0 -19 


Final tableau—phase II 
The solution x, = 1, x; = 2 can be inserted in the expression for x, giving 
x, =—74+2-142-2=-1; 
thus the final solution is 


x,=-l, »%=0, x=1, x,=0, x5—2. 


3.6 MATRIX FORM OF THE SIMPLEX METHOD 


Although the elementary pivot transformations associated with the simplex method 
are in many respects most easily discernible in the tableau format, with attention 
focused on the individual elements, there is much insight to be gained by studying 


3.6 Matrix Form of the Simplex Method 55 


a matrix interpretation of the procedure. The vector—matrix relationships that exist 
between the various rows and columns of the tableau lead, however, not only to 
increased understanding but also, in a rather direct way, to the revised simplex 
procedure which in many cases can result in considerable computational advantage. 
The matrix formulation is also a natural setting for the discussion of dual linear 
programs and other topics related to linear programming. 

A preliminary observation in the development is that the tableau at any point in 
the simplex procedure can be determined solely by a knowledge of which variables 
are basic. As before we denote by B the submatrix of the original A matrix consisting 
of the m columns of A corresponding to the basic variables. These columns are 
linearly independent and hence the columns of B form a basis for E”. We refer to 
B as the basis matrix. 

As usual, let us assume that B consists of the first m columns of A. Then by 
partitioning A, x, and c” as 


A= [B, D] 
X= (Xp, Xp), C= liters ral ; 
the standard linear program becomes 
minimize ¢4Xp + ¢pXp 
subject to Bxg +Dxp =b (29) 


Xp 20, xp2>0. 


The basic solution, which we assume is also feasible, corresponding to the 
basis B is x = (xg, 0) where x, = B“'b. The basic solution results from setting 
Xp = 0. However, for any value of xp the necessary value of xp can be computed 
from (29) as 


x, = B'b—B™'Dxp, (30) 
and this general expression when substituted in the cost function yields 


z=¢,(B 'b—B 'Dxp) +¢pxp 


(31) 
= (3B "'b+ (cy —cgB'D) xp, 
which expresses the cost of any solution to (29) in terms of xp. Thus 
Tp =Cp— CB 'D (32) 


is the relative cost vector (for nonbasic variables). It is the components of this 
vector that are used to determine which vector to bring into the basis. 
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Having derived the vector expression for the relative cost it is now possible to 
write the simplex tableau in matrix form. The initial tableau takes the form 


| 
| ; (33) 
] 


which is not in general in canonical form and does not correspond to a point in 
the simplex procedure. If the matrix B is used as a basis, then the corresponding 
tableau becomes 


B'D 


! 
i (34) 
| 


T TR-1 
Cp — cg BD 


which is the matrix form we desire. 


3.7 THE REVISED SIMPLEX METHOD 


Extensive experience with the simplex procedure applied to problems from various 
fields, and having various values of n and m, has indicated that the method can be 
expected to converge to an optimum solution in about m, or perhaps 3m/2, pivot 
operations. (Except in the worst case. See Chapter 5.) Thus, particularly if m is 
much smaller than n, that is, if the matrix A has far fewer rows than columns, 
pivots will occur in only a small fraction of the columns during the course of 
optimization. 

Since the other columns are not explicitly used, it appears that the work 
expended in calculating the elements in these columns after each pivot is, in some 
sense, wasted effort. The revised simplex method is a scheme for ordering the 
computations required of the simplex method so that unnecessary calculations are 
avoided. In fact, even if pivoting is eventually required in all columns, but m is 
small compared to n, the revised simplex method can frequently save computational 
effort. 

The revised form of the simplex method is this: Given the inverse B~! of a 
current basis, and the current solution xp = yy = B~'b, 


Step 1. Calculate the current relative cost coefficients rj = ¢y — ¢gB~'D. This can 
best be done by first calculating A’ = c,B™! and then the relative cost vector 
rp = Cp —A’D. If rp > 0 stop; the current solution is optimal. 


Step 2. Determine which vector a, is to enter the basis by selecting the most 
negative cost coefficient; and calculate y, = B'a, which gives the vector a, 
expressed in terms of the current basis. 
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Step 3. If no y,, > 0, stop; the problem is unbounded. Otherwise, calculate the 
ratios yio/y;,, for y;, > 0 to determine which vector is to leave the basis. 


Step 4. Update B~! and the current solution B~'b. Return to Step 1. 


Updating of B~! is accomplished by the usual pivot operations applied to an 
array consisting of B-! and y,, where the pivot is the appropriate element in y,. Of 
course B~'b may be updated at the same time by adjoining it as another column. 

To begin the procedure one requires, as always, an initial basic feasible solution 
and, in this case, the inverse of the initial basis. In most problems the initial basis 
(and hence also its inverse) is an identity matrix, resulting either from slack or 
surplus variables or from artificial variables. The inverse of any initial basis can, 
however, be explicitly calculated in order to initiate the revised simplex procedure. 

To illustrate the method and to indicate how the computations and storage can 
be handled, we consider an example. 


Example 1. We solve again Example | of Section 3.4. The vectors are listed once 
for reference 


» 

oa 
Bewure® 
coors 
oro® 
-ocof 
Anno 


and the objective function is determined by 
ce’ = [-3, —1, —3, 0, 0, 0]. 


We start with an initial basic feasible solution and corresponding B~! as shown 
in the tableau below 


Variable B! Xp 
—_ 

4 1 0) 0 2 

5 0 1 0 5 

6 0) 0 1 6 


We compute 
A’ =[0, 0, 0]B™' = [0, 0, 0] 
and then 


rp = Cy —A'D = [-3, -1, -3]. 
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We decide to bring a, into the basis (violating the rule of selecting the most negative 
relative cost in order to simplify the hand calculation). Its current representation is 
found by multiplying by B~!; thus we have 


Variable B! Xp Y2 
o-oo 
4 1 0 020 
| 0 1 0 5 2 
6 0 0 1 6 2 


After computing the ratios in the usual manner, we select the pivot indicated. The 
updated tableau becomes 


Variable B! Xp 
——w 

2 1 0 0 2 

5 —2 1 0 1 

6 —2 0 1 2 


then 


A’ = [-1,0, 0]B™' = [—1, 0, 0] 


n=l, n= 2, 4 =1. 


We select a, to enter. We have the tableau 


Variable B! Xp OY3 
——— 
2 1 0 0 2 1 
— i -o 41 @ 
6 —2 0 1 2 -1 


Using the pivot indicated we obtain 


Variable B Xp 
ssa 

2 3 -l1 0 1 

3 —2 1 0 i 

6 —4 1 It. 23 
Now 

A? =[-1, —3, 0] B' = [3, —2, 0], 
and 
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We select a, to enter the basis. We have the tableau 


Variable B! Xx, YY) 
————_ 

2 3 =] 0 1 b) 

3 —2 1 0 1. .=3 

6 —4 1 1 3. -5 


Using the pivot indicated we obtain 
Variable B a 
aaa 


1 3/5 =1/5 0 1/5 
3 -1/5 2/5 0 8/5 
6 -1 ae | 4 


A? =[-3, —3, 0]B"! = [-6/5, —3/5, 0], 
and 
a1 (5, Te=6/5,: t= 3/5, 


Since the r;’s are all nonnegative, we conclude that the solution x = (1/5, 0, 
8/5, 0,0, 4) is optimal. 


*3.8 THE SIMPLEX METHOD AND LU 
DECOMPOSITION 


We may go one step further in the matrix interpretation of the simplex method and 
note that execution of a single simplex cycle is not explicitly dependent on having 
B~! but rather on the ability to solve linear systems with B as the coefficient 
matrix. Thus, the revised simplex method stated at the beginning of Section 3.7 
can be restated as: Given the current basis B, 


Step 1. Calculate the current solution xg = y, satisfying By, = b. 


Step 2. Solve A’B = cg, and set rh =e, —A’D. If ry > 0, stop; the current 
solution is optimal. 


Step 3. Determine which vector a, is to enter the basis by selecting the most 
negative relative cost coefficient, and solve By, = a,. 


Step 4. If no y,, > 0, stop; the problem is unbounded. Otherwise, calculate the 
ratios yio/y;, for y;, > 0 and select the smallest nonnegative one to determine 
which vector is to leave the basis. 


Step 5. Update B. Return to Step 1. 
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In this form it is apparent that there is no explicit need for having B~!, but 
rather it is only necessary to solve three systems of equations, two involving 
the matrix B and one (the one for A) involving B’. In previous sections these 
three equations were solved, as the method progressed, by the pivoting operations. 
From the viewpoints of efficiency and numerical stability, however, this pivoting 
procedure is not as effective as the method of Gaussian elimination for general 
systems of linear equations (see Appendix C), and it therefore seems appropriate to 
investigate the possibility of adapting the numerically superior method of Gaussian 
elimination to the simplex method. The result is a version of the revised simplex 
method that possesses better numerical stability than other methods, and which for 
large-scale problems can offer tremendous storage advantages. 

We concentrate on the problem of solving the linear systems 


By,=b, A'’B=cy, By, =a, (35) 


that are required by a single step of the simplex method. Suppose B has been 
decomposed into the form B = LU where L is a lower triangular matrix and U is 
an upper triangular matrix.’ Then each of the linear systems (35) can be solved by 
solving two triangular systems. Since solving in this fashion is simple, knowledge 
of L and U is as good as knowledge of B™!. 

Next, we show how the LU decomposition of B can be updated when a single 
basis vector is changed. At the beginning of the simplex cycle suppose B has 
the form 


B= [a,,a,...,4,,]. 


? m 


At the end of the cycle we have the new basis 


B= [ay Bisse Ap gc Bpege nos @ys Ay] 4 


where it should be noted that when a, is dropped all subsequent vectors are shifted 
to the left, and the new vector a, is appended on the right. This procedure leads to 


a fairly simple updating technique. 
We have 


LB =|(L a, tse Ap dee Gl ay, a, | 


m? 


= [Ujp Mi aisi gy ose a] =H, 


m? 


‘For simplicity, we are assuming that no row interchanges are required to produce the LU 
decomposition. This assumption can be relaxed, but both the notation and the method itself 
become somewhat more complex. In practice row interchanges are introduced to preserve 
accuracy or sparsity. 
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where the u,’s are the columns of U. The matrix H takes the form 


=| 
fl 
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with zeros below the main diagonal in the first k — 1 columns, and zeros below the 
element immediately under the diagonal in all other columns. The matrix H itself 
can be constructed without additional computation, since the u,’s are known and 


L'a, is a by-product in the computation of y,. 


H can be reduced to upper triangular form by using Gaussian elimination to 
zero out the subdiagonal elements. Thus the upper triangular matrix U can be 
obtained from H by application of a series of transformations, each having the form 


m, 1 


for i=k,k+1,...,m—1. The matrix U becomes 


U=M,,_(M,,_>...M,H. 


m—2 


We then have 


and thus evaluating 


we obtain the decomposition 


(36) 


(37) 


(38) 


(39) 


(40) 
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Since M;'! is simply M,; with the sign of the off-diagonal term reversed, evaluation 
of L is straightforward. 

There are numerous variations of this basic idea. The elementary transforma- 
tions (36) can be carried rather than explicitly evaluating L, the LU decomposition 
can be periodically reevaluated, and row and column interchanges can be handled in 
such a way as to maximize stability or minimize the density of the decomposition. 
Some of these extensions are discussed in the references at the end of the chapter. 


3.9 DECOMPOSITION 


Large linear programming problems usually have some special structural form 
that can (and should) be exploited to develop efficient computational procedures. 
One common structure is where there are a number of separate activity areas that 
are linked through common resource constraints. An example is provided by a 
multidivisional firm attempting to minimize the total cost of its operations. The 
divisions of the firm must each meet internal requirements that do not interact with 
the constraints of other divisions; but in addition there are common resources that 
must be shared among divisions and thereby represent linking constraints. 

A problem of this form can be solved by the Dantzig—Wolfe decomposition 
method described in this section. The method is an iterative process where at each 
step a number of separate subproblems are solved. The subproblems are themselves 
linear programs within the separate areas (or within divisions in the example of 
the firm). The objective functions of these subproblems are varied from iteration 
to iteration and are determined by a separate calculation based on the results 
of the previous iteration. This action coordinates the individual subproblems so 
that, ultimately, the solution to the overall problem is solved. The method can be 
derived as a special version of the revised simplex method, where the subproblems 
correspond to evaluation of reduced cost coefficients for the main problem. 

To describe the method we consider the linear program in standard form 


minimize c’x 
subject to Ax=b (41) 


x>0. 


Suppose, for purposes of this entire section, that the A matrix has the special 
“block-angular” structure: 


L, L, er Ly 


A= A, (42) 
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By partitioning the vectors x,c’, and b consistent with this partition of A, the 
problem can be rewritten as 


minimize }\¢; x; 


subject to )°L;x; = bo (43) 


This may be viewed as a problem of minimizing the total cost of N different linear 
programs that are independent except for the first constraint, which is a linking 
constraint of, say, dimension m. 

Each of the subproblems is of the form 


minimize ¢/ x; 
subject to A;x;=b, (44) 


x, >0. 


1 


The constraint set for the ith subproblem is S; = {x; : A;x; = b;, x; > 0}. As for 
any linear program, this constraint set S; is a polytope and can be expressed as 
the intersection of a finite number of closed half-spaces. There is no guarantee 
that each S; is bounded, even if the original linear program (41) has a bounded 
constraint set. We shall assume for simplicity, however, that each of the polytopes 
S;,i=1,...,N is indeed bounded and hence is a polyhedron. One may guarantee 
that this assumption is satisfied by placing artificial (large) upper bounds on each 
X;. 

Under the boundedness assumption, each polyhedron S; consists entirely of 
points that are convex combinations of its extreme points. Thus, if the extreme 


points of S; are {X;),X;2,---, X;x,}, then any point x; € S; can be expressed in the form 
K; 
X;= D0 aX; 
j=l 
“! (45) 
where i a;,=1 
jz 


and a;; 2 0, FH Ngan Kj 


The a;;’s are the weighting coefficients of the extreme points. 

We now convert the original linear program to an equivalent master problem, of 
which the objective is to find the optimal weighting coefficients for each polyhedron, 
S;. Corresponding to each extreme point x;; in S;, define p;; = c7X;; and q;; = L;x;;. 
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Clearly p;; is the equivalent cost of the extreme point x,;, and q,; is its equivalent 
activity vector in the linking constraints. 

Then the original linear program (41) is equivalent, using (45), to the master 
problem: 


NK; 
minimize >>> pj; 
i=l jal 


N K; 
subject to °° q,a;; = bo (46) 


i=1 j=l 


al 
— 


This master problem has variables 
a= (Grips 6s Ore,» Oats sees Oog,s ++s Ants ++ > Onk,) 
and can be expressed more compactly as 


minimize p’a@ 


subject to Qa=g (47) 
a>0, 
where g’ = [b),1,1,..., 1]; the element of p associated with a@;; 18 p;;; and the 


column of Q associated with @;; is 


qi; 
e, |? 


with e; denoting the ith unit vector in E%. 

Suppose that at some stage of the revised simplex method for the master 
problem we know the basis B and corresponding simplex multipliers A’ = p,B'. 
The corresponding relative cost vector is rj = ¢y — A’D, having components 


Vij = Pij — 7 || : (48) 
i 

It is not necessary to calculate all the r;,’s; it is only necessary to determine the 

minimal r;,. If the minimal value is nonnegative, the current solution is optimal and 

the process terminates. If, on the other hand, the minimal element is negative, the 

corresponding column should enter the basis. 
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The search for the minimal element in (48) is normally made with respect 
to nonbasic columns only. The search can be formally extended to include basic 
columns as well, however, since for basic elements 


qij 
py—-At bal =U, 


The extra zero values do not influence the subsequent procedure, since a new 
column will enter only if the minimal value is less than zero. 

We therefore define r* as the minimum relative cost coefficient for all possible 
basis vectors. That is, 


r* = minimum } r* = minimum{p,,— a7 | 44 . 
ie{I,....N} ! je{1,....K;} {Pi 5 }} 


Using the definitions of p;; and q;;, this becomes 


* Cee f. T 
= Cee {¢; Xj; — Ap Lx; — Ansi} , (49) 


where A, is the vector made up of the first m elements of A, m being the number 
of rows of L; (the number of linking constraints in (43)). 
The minimization problem in (49) is actually solved by the ith subproblem: 


minimize (c) — AjL;,)x; 
subject to A;x;=b; (50) 


x, >0. 
This follows from the fact that A,,,; 1s independent of the extreme point index j 
(since A is fixed during the determination of the r,’s), and that the solution of (50) 
must be that extreme point of S,, say x,,, of minimum cost, using the adjusted cost 
coefficients ¢/ — Aj L;. 

Thus, an algorithm for this special version of the revised simplex method 
applied to the master problem is the following: Given a basis B 


Step 1. Calculate the current basic solution xg, and solve A’B = cj for A. 
Step 2. For each i= 1,2,...,N, determine the optimal solution x* of the ith 
subproblem (50) and calculate 
rp = (c; = AjL;) x; - Minis: (51) 
If all r* > 0, stop; the current solution is optimal. 
Step 3. Determine which column is to enter the basis by selecting the minimal r*. 


Step 4. Update the basis of the master problem as usual. 
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This algorithm has an interesting economic interpretation in the context of a 
multidivisional firm minimizing its total cost of operations as described earlier. 
Division i’s activities are internally constrained by Ax; = b;, and the common 
resources by impose linking constraints. At Step | of the algorithm, the firm’s central 
management formulates its current master plan, which is perhaps suboptimal, and 
announces a new set of prices that each division must use to revise its recommended 
strategy at Step 2. In particular, — A, reflects the new prices that higher management 
has placed on the common resources. The division that reports the greatest rate of 
potential cost improvement has its recommendations incorporated in the new master 
plan at Step 3, and the process is repeated. If no cost improvement is possible, 
central management settles on the current master plan. 


Example 2. Consider the problem 


minimize xX; — 2x, 4y, — 3y, 
subject to xy xy + Dy, <4 
Xy+ yt <3 
2X, + Xp <4 
X,+ Xy <2 
Yt yy <2 
3y, + 2y, <5 
x, >0, xX, >0, y, > 0, y, > 0. 


The decomposition algorithm can be applied by introducing slack variables and 
identifying the first two constraints as linking constraints. Rather than using double 
subscripts, the primary variables of the subsystems are taken to be x = (x), x), 


y = (y,, 2). 


Initialization. Any vector (x, y) of the master problem must be of the form 


1 J 
x= py ax, y= SEY. 
i=1 j=l 


where x; and y, are extreme points of the subsystems, and 


Therefore the master problem is 


Fé J 
minimize )?p,a,+)0t,B; 
j=l 


i=1 


1 i 
subject to)? a,L,x,+))6,L.y;+s=b 


i=l jel 
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T 
y esi, a,>0, i=1,2,...,/ 


i=1 


j 
YB=1, B20 FHL2d 
j=l 


where p; is the cost of x;, t; is the cost of y;, and where s = (5,, 52) is a vector of 
slack variables for the linking constraints. This problem corresponds to (47). 

A starting basic feasible solution is s=b, a, = 1, 8, = 1, where x; = 0, y, = 0 
are extreme points of the subsystems. The corresponding starting basis is B =I 
and, accordingly, the initial tableau for the revised simplex method for the master 
problem is 


Variable B! Value 
s 1 0 0 0 4 
5 oO 1 0 0 3 
— Oo 2 - 0 1 
2 oOo © Oo 1 1 


Then A? = [0, 0,0, 0] B-! = [0, 0, 0, O]. 


Iteration I. The relative cost coefficients are found by solving the subproblems 
defined by (50). The first is 


minimize —x, — 2x, 
subject to 2x, +x, <4 
xX; +x, <2 
x, > 0, x, > 0. 


This problem can be solved easily (by the simplex method or by inspection). The 
solution is x = (0, 2), with r; = —4. 

The second subsystem is solved correspondingly. The solution is y = (1, 1) 
with r, = —7. 

It follows from Step 2 of the general algorithm that r* = —7. We let y, = (1, 1) 
and bring 8, into the basis of the master problem. 


Master Iteration. The new column to enter the basis is 


L,y, 


0 
1 


Fe ONN 
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and since the current basis is B = I, the new tableau is 


New 
Variable B"! Value column 
s 1 0 0 0 4 2 
» 0 1 0 0 3 2 
a oO .& 0 1 0 
B 0 0 0 1 1 ® 
which after pivoting leads to 
Variable B"! Value 
¢ it 0 ©0© = 2 
6 iW & > = 1 
a Oo o 1 0 1 
& 0 &% 4 1 1 
Since t, =} y, = —7, we find 


A=[000 —7] B''=[000 —7]. 


Iteration 2. Since Aj, which comprises the first two components of A, has not 
changed, the subproblems remain the same, but now according to (51), r* = —4 
and a, should be brought into the basis, where x, = (0, 2). 

Master Iteration. The new column to enter the basis is 


ornN NY 


This must be multiplied by B~! to obtain its representation in terms of the current 
basis (but the representation does not change it in this case). The master tableau is 
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then updated as follows: 


Variable B! Value New 
column 
5 1 0 oO -2 2 2 
s 0 nn) an 1 © 
a, 0 0 1 0 1 1 
B, 0 0 0 1 1 0 
Variable B! Value 
i. -1 0 0 1 
a, ) 1/2 0 —l1 1/2 
a, 0 —1/2 1 1 1/2 
B, 0 0 0 1 1 
Since p, = —4, we have 


AT = (0, —4, 0, -7]B™ = [0, —2, 0, —3]. 


Iteration 3. The subsystem’s problems are now 


minimize — x, minimize —2y, — y,+3 
subject to 2x, +2%<4 subject to y, + yy <2 
xX, +x, <2 3y, + 2y <5 
20, 250 y20, y, 20. 


It follows that x, = (2,0) and a, should be brought into the basis. 
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Master Iteration. Proceeding as usual, we obtain the new tableau and new A as 


follows. 
Variable B! Value 
5; 1 “1 0 0 i 2 
Qy 0 1/2 0 —1 1/2 O 
a, 0 —1/2 1 1/2 @) 
B, 0 0 0 1 12 0 
‘i 1 GO aes: ao 0 
Q 0 1/2 0 —1l 1/2 
a; 0 —1/2 1 1 1/2 
B, 0 0 0 1 1 


AT =[0, —4, —2, -7] B“ = [0, -1, —2, —5] 
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The subproblems now have objectives —x, —x,-+2 and —3y, —2y, +5, respectively, 
which both have minimum values of zero. Thus the current solution is optimal. The 
solution is (1/2)x, + (1/2)x, +y>, or equivalently, x, =1, x, =1, y,=1, y,=1. 


3.10 SUMMARY 


The simplex method is founded on the fact that the optimal value of a linear program, 
if finite, is always attained at a basic feasible solution. Using this foundation there 
are two ways in which to visualize the simplex process. The first is to view the 
process as one of continuous change. One starts with a basic feasible solution 
and imagines that some nonbasic variable is increased slowly from zero. As the 
value of this variable is increased, the values of the current basic variables are 
continuously adjusted so that the overall vector continues to satisfy the system of 
linear equality constraints. The change in the objective function due to a unit change 
in this nonbasic variable, taking into account the corresponding required changes 
in the values of the basic variables, is the relative cost coefficient associated with 
the nonbasic variable. If this coefficient is negative, then the objective value will 
be continuously improved as the value of this nonbasic variable is increased, and 
therefore one increases the variable as far as possible, to the point where further 
increase would violate feasibility. At this point the value of one of the basic variables 
is zero, and that variable is declared nonbasic, while the nonbasic variable that was 
increased is declared basic. 

The other viewpoint is more discrete in nature. Realizing that only basic 
feasible solutions need be considered, various bases are selected and the corre- 
sponding basic solutions are calculated by solving the associated set of linear 
equations. The logic for the systematic selection of new bases again involves the 
relative cost coefficients and, of course, is derived largely from the first, continuous, 
viewpoint. 


3.11 EXERCISES 


1. Using pivoting, solve the simultaneous equations 


3x, +2x, =5 
5x, + xX, =9. 


2. Using pivoting, solve the simultaneous equations 


X,+2x,+2%3;=7 
2x; —X)+2x,=6 
X; +X, +3x3 = 12. 


3. Solve the equations in Exercise 2 by Gaussian elimination as described in Appendix C. 
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4. Suppose B is an mx m square nonsingular matrix, and let the tableau T be 
constructed, T = [I,B] where I is the m x m identity matrix. Suppose that pivot 
operations are performed on this tableau so that it takes the form [C, I]. Show that 
C=B-, 


5. Show that if the vectors aj,,a),...,a, are a basis in EE”, the vectors 
@,,4),--., 8,1, Ag, 4,,1,---,a,, also are a basis if and only if y,, 40, where y,,, is 
defined by the tableau (7). 


6. If r; > 0 for every j corresponding to a variable x; that is not basic, show that the 
corresponding basic feasible solution is the unique optimal solution. 


7. Show that a degenerate basic feasible solution may be optimal without satisfying r; > 0 
for all j. 


8. a) Using the simplex procedure, solve 


maximize —x,+x5 
subject to x;—xX, <2 
XxX; +% <6 
x, 20, x, 20 


b) Draw a graphical representation of the problem in x,, x, space and indicate the path 
of the simplex steps. 
c) Repeat for the problem 


maximize x,+x, 


subject to —2x,+%, <1 
x 


9. Using the simplex procedure, solve the spare-parts manufacturer’s problem (Exercise 4, 
Chapter 2). 


10. Using the simplex procedure, solve 


minimize 2x, +4x%,+4%, +x 
subject to =. x, + 3x, +x4<4 
2x, + Xp <3 
Xy + 4x3 4+ x4 <3 
x, 20 i=1,2,3,4. 


11. For the linear program of Exercise 10 


a) How much can the elements of b = (4,3,3) be changed without changing the 
optimal basis? 

b) How much can the elements of c = (2,4, 1, 1) be changed without changing the 
optimal basis? 
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12. 


13. 


14. 


15. 


16. 


17. 
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c) What happens to the optimal cost for small changes in b? 
d) What happens to the optimal cost for small changes in ¢? 


Consider the problem 


minimize x, — 3x, —0.4x, 


subject to 3x,— X,+ 2x3 <7 
—2x, + 4x, <2 


—4x, + 3x,+ 3x; < 14 
x, 20, 20, x20. 


a) Find an optimal solution. 

b) How many optimal basic feasible solutions are there? 

c) Show that if c,+ ia wt tay, > 0, then another activity x, can be introduced with 
cost coefficient c, and activity vector (a,4, d24, 434) without changing the optimal 
solution. 


Rather than select the variable corresponding to the most negative relative cost coefficient 
as the variable to enter the basis, it has been suggested that a better criterion would be 
to select that variable which, when pivoted in, will produce the greatest improvement 
in the objective function. Show that this criterion leads to selecting the variable x, 
corresponding to the index k minimizing ymax, Tio! Vik 

ik 


In the ordinary simplex method one new vector is brought into the basis and one removed 
at every step. Consider the possibility of bringing two new vectors into the basis and 
removing two at each stage. Develop a complete procedure that operates in this fashion. 


Degeneracy. If a basic feasible solution is degenerate, it is then theoretically possible 
that a sequence of degenerate basic feasible solutions will be generated that endlessly 
cycles without making progress. It is the purpose of this exercise and the next two to 
develop a technique that can be applied to the simplex method to avoid this cycling. 

Corresponding to the linear system Ax = b where A = [a,, a),...,a,] define the 
perturbed system Ax = b(e) where b(¢) = b+ea,+¢e7a,+---+6"a,,€ > 0. Show that 
if there is a basic feasible solution (possibly degenerate) to the unperturbed system with 
basis B = [a,, a), ...,a,,], then corresponding to the same basis, there is a nondegenerate 
basic feasible solution to the perturbed system for some range of ¢ > 0. 


Show that corresponding to any basic feasible solution to the perturbed system of 
Exercise 15, which is nondegenerate for some range of ¢ > 0, and to a vector a, not in 
the basis, there is a unique vector a; in the basis which when replaced by a, leads to a 
basic feasible solution; and that solution is nondegenerate for a range of ¢ > 0. 


Show that the tableau associated with a basic feasible solution of the perturbed system 
of Exercise 15, and which is nondegenerate for a range of ¢ > 0, is identical with that of 
the unperturbed system except in the column under b(é€). Show how the proper pivot in 
a given column to preserve feasibility of the perturbed system can be determined from 
the tableau of the unperturbed system. Conclude that the simplex method will avoid 
cycling if whenever there is a choice in the pivot element of a column k, arising from a 
tie in the minimum of y,)/y,, among the elements i € Jp, the tie is resolved by finding 
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the minimum of y;;/yj;,, i € Jp. If there still remainties among elements i € J, the process 
is repeated with y;,/y,,, etc., until there is a unique element. 


18. Using the two-phase simplex procedure solve 


a) minimize —3x, +X) +323 -—X4 
subject to =x) + 2x, -— 234+ x4, =0 
2x, — 2X) + 3x3 4+ 3x, =9 
X,— Xy+2x;-— x,y =6 
x, 20, i=1,2,3,4. 


b) minimize x, +6x,— 7x3;+ x4 +5x;5 
subject to 5x, — 4x, + 13x; —2x,+ x5 =20 
Xp — Xp + 5x3 - Xy~t x, =8 
x, 20, i=1,2,3.4,5. 


19. Solve the oil refinery problem (Exercise 3, Chapter 2). 


20. Show that in the phase I procedure of a problem that has feasible solutions, if an artificial 
variable becomes nonbasic, it need never again be made basic. Thus, when an artificial 
variable becomes nonbasic its column can be eliminated from future tableaus. 


21. Suppose the phase I procedure is applied to the system Ax = b, x > 0, and that the 
resulting tableau (ignoring the cost row) has the form 


sd a ed Xe" *%Xn | V1 Y2°" "Ve “ 
1 0 o |” 
R, Ss; . : 
I 0 o | be 
0 0 0 1 0 
R, S, 1 
0 0 ho 


This corresponds to having m —k basic artificial variables at zero level. 


a) Show that any nonzero element in R, can be used as a pivot to eliminate a basic 
artificial variable, thus yielding a similar tableau but with k increased by one. 

b) Suppose that the process in (a) has been repeated to the point where R, = 0. Show that 
the original system is redundant, and show how phase II may proceed by eliminating 
the bottom rows. 

c) Use the above method to solve the linear program 


minimize 2x, +6x,+%x;+ X4 
subject to =X, + 2x, + x,=6 
xX, +2x,+%3+ xX4=7 
X, + 3x, — x3 +2x,=7 
X, + Xy + %3 =5 
x,20, 5» 20, 20, x,20. 
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22. 


23. 


24. 
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Find a basic feasible solution to 


X,)+2X%)— x3+ x4 =3 
2x, +4x%) + x3 4+2x4= 12 
X; + 4x) +2x3+ x4=9 
x, 20, i=1,2,3,4. 


Consider the system of linear inequalities Ax > b, x > 0 with b > 0. This system can 
be transformed to standard form by the introduction of m surplus variables so that it 
becomes Ax—y=b, x > 0, y > 0. Let b, = max; b, and consider the new system in 
standard form obtained by adding the kth row to the negative of every other row. Show 
that the new system requires the addition of only a single artificial variable to obtain an 
initial basic feasible solution. 

Use this technique to find a basic feasible solution to the system. 


X,4+2x%,+ x3 24 
2x, + X+ «325 
2x, + 3x, + 2x, 26 
x,;20, i=1,2,3. 


It is possible to combine the two phases of the two-phase method into a single procedure 
by the big—M method. Given the linear program in standard form 


minimize c’x 
subjectto Ax=b 
x>0, 


one forms the approximating problem 


minimize e’x+M)-y, 
i=l 
subject to Ax +y=b 
x>0 
y>0. 


In this problem y = (y,, y>,.--, y,,) is a vector of artificial variables and M is a large 
m 
constant. The term M 5° y, serves as a penalty term for nonzero y,’s. 


i=1 
If this problem is solved by the simplex method, show the following: 


a) If an optimal solution is found with y = 0, then the corresponding x is an optimal 
basic feasible solution to the original problem. 

b) If for every M > 0 an optimal solution is found with y + 0, then the original problem 
is infeasible. 

c) If forevery M > 0 the approximating problem is unbounded, then the original problem 
is either unbounded or infeasible. 

d) Suppose now that the original problem has a finite optimal value V(co). Let V(M) 
be the optimal value of the approximating problem. Show that V(M) < V(co). 

e) Show that for M, < M, we have V(M,) < V(M,). 

f) Show that there is a value M, such that for M > My, V(M) = V(co), and hence 
conclude that the big—M method will produce the right solution for large enough 
values of M. 
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25. A certain telephone company would like to determine the maximum number of long- 
distance calls from Westburgh to Eastville that it can handle at any one time. The 
company has cables linking these cities via several intermediary cities as follows: 


Northgate Northbay 


Westburgh Eastville 


Southgate Southbay 


Each cable can handle a maximum number of calls simultaneously as indicated in 
the figure. For example, the number of calls routed from Westburgh to Northgate 
cannot exceed five at any one time. A call from Westburgh to Eastville can be routed 
through any other city, as long as there is a cable available that is not currently being 
used to its capacity. In addition to determining the maximum number of calls from 
Westburgh to Eastville, the company would, of course, like to know the optimal routing 
of these calls. Assume calls can be routed only in the directions indicated by the 
arrows. 


a) Formulate the above problem as a linear programming problem with upper bounds. 
(Hint: Denote by x;; the number of calls routed from city i to city j.) 
b) Find the solution by inspection of the graph. 


26. Using the revised simplex method find a basic feasible solution to 


X,;+2X%)— x3+ x4 =3 
2x, + 4x, + x, +2x,= 12 
X, + 4x) +2x3+ x%4=9 
42:0; 152,34. 


27. The following tableau is an intermediate stage in the solution of a minimization problem: 


Y1 Y2 Y3 Ya Ys Yo Yo 

1 2/3 0 O 4/3 0 4 

0 -7/3 3 1 —2/3 0 2 

0 —2/3 —2 0 2/3, 1 2 

r 0 8/3 -ll O 4/3 0 -8 


a) Determine the next pivot element. 
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28. 


29. 


30. 


31. 
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b) Given that the inverse of the current basis is 


1 1 1 -l 
B! =[a,,a,,a,.]' = 5 1 -2 
-l 2 1 


and the corresponding cost coefficients are 
Ch = (eis Case) = (—1,-3,.1), 
find the original problem. 


In many applications of linear programming it may be sufficient, for practical purposes, 
to obtain a solution for which the value of the objective function is within a prede- 
termined tolerance ¢ from the minimum value z*. Stopping the simplex algorithm at 
such a solution rather than searching for the true minimum may considerably reduce the 
computations. 


a) Consider a linear programming problem for which the sum of the variables is known 
to be bounded above by s. Let z) denote the current value of the objective function 
at some stage of the simplex algorithm, (c;—z,) the corresponding relative cost 
coefficients, and 


M= -—C,). 
max (z;~6) 


Show that if M < e/s, then z)—z* <e. 

b) Consider the transportation problem described in Section 2.2 (Example 2). Assuming 
this problem is solved by the simplex method and it is sufficient to obtain a 
solution within ¢ tolerance from the optimal value of the objective function, specify 
a stopping criterion for the algorithm in terms of ¢ and the parameters of the 
problem. 


Work out an extension of LU decomposition, as described in Appendix C, when row 
interchanges are introduced. 


Work out the details of LU decomposition applied to the simplex method when row 
interchanges are required. 


Anticycling Rule. A remarkably simple procedure for avoiding cycling was developed 
by Bland, and we discuss it here. 
Bland’s Rule. In the simplex method: 


a) Select the column to enter the basis by j = min{j: 1; <0}; that is, select the lowest- 
indexed favorable column. 

b) Incase ties occur in the criterion for determining which column is to leave the basis, 
select the one with lowest index. 


We can prove by contradiction that the use of Bland’s rule prohibits cycling. Suppose 
that cycling occurs. During the cycle a finite number of columns enter and leave the 
basis. Each of these columns enters at level zero, and the cost function does not change. 
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Delete all rows and columns that do not contain pivots during a cycle, obtaining a new 
linear program that also cycles. Assume that this reduced linear program has m rows 
and n columns. Consider the solution stage where column n is about to leave the basis, 
being replaced by column p. The corresponding tableau is as follows (where the entries 
shown are explained below): 


aoc a, or: b 
<0 0 0 

<0 0 0 

>0 1 O 

ce <0 0 0 


Without loss of generality, we assume that the current basis consists of the last m 
columns. In fact, we may define the reduced linear program in terms of this tableau, 
calling the current coefficient array A and the current relative cost vector ec. In this 
tableau we pivot on d,,,,, SO d,,, > 0. By Part b) of Bland’s rule, a, can leave the basis 
only if there are no ties in the ratio test, and since b = 0 because all rows are in the 
cycle, it follows that a;, < 0 for all i A m. 

Now consider the situation when column n is about to reenter the basis. Part a) 
of Bland’s rule ensures that r, <0 and r; > 0 for all i#n. Apply the formula 7; = 
c; — A‘a; to the last m columns to show that each component of A except A,, is nonpos- 
itive; and A,, > 0. Then use this to show that 7, = c, — A‘a, <c, <0, contradicting 
r, 2 0. 


32. Use the Dantzig—Wolfe decomposition method to solve 
minimize 4x, Xy — 3x3 — 2X4 


subject to 2x, + 2x) + x; + 2x4 <6 
Xy + 2x3, +3x,<4 


2x, + Xy <5 
Xo <1 

— X,+2x,<2 
X3+2x, <6 


x, 20, » 20, x20, x,20. 
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Chapter4 DUALITY 


Associated with every linear program, and intimately related to it, is a corresponding 
dual linear program. Both programs are constructed from the same underlying cost 
and constraint coefficients but in such a way that if one of these problems is one 
of minimization the other is one of maximization, and the optimal values of the 
corresponding objective functions, if finite, are equal. The variables of the dual 
problem can be interpreted as prices associated with the constraints of the original 
(primal) problem, and through this association it is possible to give an economically 
meaningful characterization to the dual whenever there is such a characterization 
for the primal. 

The variables of the dual problem are also intimately related to the calcu- 
lation of the relative cost coefficients in the simplex method. Thus, a study of 
duality sharpens our understanding of the simplex procedure and motivates certain 
alternative solution methods. Indeed, the simultaneous consideration of a problem 
from both the primal and dual viewpoints often provides significant computational 
advantage as well as economic insight. 


4.1 DUAL LINEAR PROGRAMS 


In this section we define the dual program that is associated with a given linear 
program. Initially, we depart from our usual strategy of considering programs 
in standard form, since the duality relationship is most symmetric for programs 
expressed solely in terms of inequalities. Specifically then, we define duality through 
the pair of programs displayed below. 


Primal Dual 
minimize c’x maximize A’b (1) 
subject to Ax >b subject to ATA <c? 
x>0 A>0 


If A is an mxn matrix, then x is an n-dimensional column vector, b is an 
n-dimensional column vector, c’ is an n-dimensional row vector, and A’ is an 
m-dimensional row vector. The vector x is the variable of the primal program, and 
A is the variable of the dual program. 
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The pair of programs (1) is called the symmetric form of duality and, as 
explained below, can be used to define the dual of any linear program. It is important 
to note that the role of primal and dual can be reversed. Thus, studying in detail 
the process by which the dual is obtained from the primal: interchange of cost 
and constraint vectors, transposition of coefficient matrix, reversal of constraint 
inequalities, and change of minimization to maximization; we see that this same 
process applied to the dual yields the primal. Put another way, if the dual is 
transformed, by multiplying the objective and the constraints by minus unity, so 
that it has the structure of the primal (but is still expressed in terms of A), its 
corresponding dual will be equivalent to the original primal. 

The dual of any linear program can be found by converting the program to 
the form of the primal shown above. For example, given a linear program in 
standard form 


minimize c’x 
subject to Ax=b 
x>0, 


we write it in the equivalent form 


minimize c¢’x 
subject to Ax>b 


—Ax >-—b 
x>0, 
A 
which is in the form of the primal of (1) but with coefficient matrix | - - - |. Using 
—A 


a dual vector partitioned as (u, v), the corresponding dual is 


minimize u’b—v’b 

subject to u’A—v’A<c? 
u>0 
v>0. 


Letting A = u—v we may simplify the representation of the dual program so that 
we obtain the pair of problems displayed below: 


Primal Dual 
minimize c’x maximize A’b (2) 
subject to Ax=b subject to ATA <c’. 
x>0 


This is the asymmetric form of the duality relation. In this form the dual vector A 
(which is really a composite of u and v) is not restricted to be nonnegative. 
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Similar transformations can be worked out for any linear program to first get 
the primal in the form (1), calculate the dual, and then simplify the dual to account 
for special structure. 

In general, if some of the linear inequalities in the primal (1) are changed to 
equality, the corresponding components of A in the dual become free variables. If 
some of the components of x in the primal are free variables, then the corresponding 
inequalities in AA <c’ are changed to equality in the dual. We mention again that 
these are not arbitrary rules but are direct consequences of the original definition 
and the equivalence of various forms of linear programs. 


Example 1 (Dual of the diet problem). The diet problem, Example 1, Section 2.2, 
was the problem faced by a dietician trying to select a combination of foods to 
meet certain nutritional requirements at minimum cost. This problem has the form 


minimize c’x 
subject to Ax>b 
x>0 


and hence can be regarded as the primal program of the symmetric pair above. We 
describe an interpretation of the dual problem. 

Imagine a pharmaceutical company that produces in pill form each of the 
nutrients considered important by the dietician. The pharmaceutical company tries 
to convince the dietician to buy pills, and thereby supply the nutrients directly rather 
than through purchase of various foods. The problem faced by the drug company 
is that of determining positive unit prices \,,\,,...,A,, for the nutrients so as to 
maximize revenue while at the same time being competitive with real food. To be 
competitive with real food, the cost of a unit of food i made synthetically from pure 
nutrients bought from the druggist must be no greater than c;, the market price of 
the food. Thus, denoting by a; the ith food, the company must satisfy A’a; < c; 
for each i. In matrix form this is equivalent to ATA < ec’. Since b }; units of the jth 
nutrient will be purchased, the problem of the druggist is 


maximize Ab 
subject to ATA <c? 
A>0, 


which is the dual problem. 


Example 2 (Dual of the transportation problem). The transportation problem, 
Example 2, Section 2.2, is the problem, faced by a manufacturer, of selecting the 
pattern of product shipments between several fixed origins and destinations so as 
to minimize transportation cost while satisfying demand. Referring to (6) and (7) 
of Chapter 2, the problem is in standard form, and hence the asymmetric version of 
the duality relation applies. There is a dual variable for each constraint. In this case 
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we denote the variables u;,i=1,2,...,m for (6) and v;,,j=1,2,...,n for (7). 
Accordingly, the dual is 


m 


n 
maximize ))a,u;+ >) jv; 
i=l j=l 


subject to U;+V; < Cj; (= 12,0035; 
1 


To interpret the dual problem, we imagine an entrepreneur who, feeling that 
he can ship more efficiently, comes to the manufacturer with the offer to buy his 
product at the plant sites (origins) and sell it at the warehouses (destinations). The 
product price that is to be used in these transactions varies from point to point, 
and is determined by the entrepreneur in advance. He must choose these prices, of 
course, so that his offer will be attractive to the manufacturer. 

The entrepreneur, then, must select prices —u,, —u,,..., —u,, for the m origins 
and v,, v,,..., v, for the n destinations. To be competitive with usual transportation 
modes, his prices must satisfy u;+ vu; < c;; for all i, j, since u;+ vu; represents the 
net amount the manufacturer must pay to sell a unit of product at origin 7 and buy 
it back again at destination j. Subject to this constraint, the entrepreneur will adjust 
his prices to maximize his revenue. Thus, his problem is as given above. 


4.2 THE DUALITY THEOREM 


To this point the relation between the primal and dual programs has been simply a 
formal one based on what might appear as an arbitrary definition. In this section, 
however, the deeper connection between a program and its dual, as expressed by 
the Duality Theorem, is derived. 

The proof of the Duality Theorem given in this section relies on the Separating 
Hyperplane Theorem (Appendix B) and is therefore somewhat more advanced than 
previous arguments. It is given here so that the most general form of the Duality 
Theorem is established directly. An alternative approach is to use the theory of the 
simplex method to derive the duality result. A simplified version of this alternative 
approach is given in the next section. 

Throughout this section we consider the primal program in standard form 


minimize c’x 
subject to Ax=b (3) 
x>0 


and its corresponding dual 


minimize A'b (4) 
subject to ATA <c’. 
In this section it is not assumed that A is necessarily of full rank. The following 


lemma is easily established and gives us an important relation between the two 
problems. 
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Dual values Primal values 


z 


Fig. 4.1 Relation of primal and dual values 


Lemma 1. (Weak Duality Lemma). /f x and X are feasible for (3) and (4), 
respectively, then e'x > AT. 


Proof. We have 
A"b=A'AX <ce’x, 


the last inequality being valid since x > 0 and ATA < cc’. fj 


This lemma shows that a feasible vector to either problem yields a bound on 
the value of the other problem. The values associated with the primal are all larger 
than the values associated with the dual as illustrated in Fig. 4.1. Since the primal 
seeks a minimum and the dual seeks a maximum, each seeks to reach the other. 
From this we have an important corollary. 


Corollary. If Xx. and Xp, are feasible for (3) and (4), respectively, and if 
c7X) =AjDb, then xX. and Ay are optimal for their respective problems. 


The above corollary shows that if a pair of feasible vectors can be found to the 
primal and dual programs with equal objective values, then these are both optimal. 
The Duality Theorem of linear programming states that the converse is also true, 
and that, in fact, the two regions in Fig. 4.1 actually have a common point; there is 
no “gap.” 


Duality Theorem of Linear Programming. [If either of the problems (3) or 
(4) has a finite optimal solution, so does the other, and the corresponding 
values of the objective functions are equal. If either problem has an unbounded 
objective, the other problem has no feasible solution. 


Proof. We note first that the second statement is an immediate consequence of 
Lemma |. For if the primal is unbounded and A is feasible for the dual, we must 
have Ab < —M for arbitrarily large M, which is clearly impossible. 

Second we note that although the primal and dual are not stated in symmetric 
form it is sufficient, in proving the first statement, to assume that the primal has 
a finite optimal solution and then show that the dual has a solution with the same 
value. This follows because either problem can be converted to standard form and 
because the roles of primal and dual are reversible. 

Suppose (3) has a finite optimal solution with value z). In the space E’*! 
define the convex set 


C= {(r,wW) : r= tz) —¢'x, w= th—Ax,x >0,t>0}. 


It is easily verified that C is in fact a closed convex cone. We show that the point 
(1, 0) is not in C. If w= 4)b— Ax, = 0 with tf) > 0, x, > 0, then x = x)/fy is 
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feasible for (3) and hence r/t) = z) — ex < 0; which means r < 0. If w= —Ax, = 0 
with x, > 0 and c’x, = —1, and if x is any feasible solution to (3), then x + aX, is 
feasible for any a > 0 and gives arbitrarily small objective values as a is increased. 
This contradicts our assumption on the existence of a finite optimum and thus we 
conclude that no such x, exists. Hence (1,0) ¢ C. 

Now since C is a closed convex set, there is by Theorem 1, Section B.3, a 
hyperplane separating (1, 0) and C. Thus there is a nonzero vector [s, A] € E”*! 
and a constant c such that 


s<c=inf {sr+A’w: (r,w) €C}. 


Now since C is a cone, it follows that c > 0. For if there were (r, w) € C such that 
sr +A'w <0, then a(r, w) for large a would violate the hyperplane inequality. On 
the other hand, since (0, 0) € C we must have c < 0. Thus c = 0. As a consequence 
s <0, and without loss of generality we may assume s = —1. 

We have to this point established the existence of A € E” such that 


—r+A’w>0 
for all (r, w) € C. Equivalently, using the definition of C, 
(c— AA) X— 1% +tAb >0 


for all x > 0, t > 0. Setting t= 0 yields A7A < c’, which says A is feasible for the 
dual. Setting x = 0 and t= 1 yields A’b > zp, which in view of Lemma | and its 
corollary shows that A is optimal for the dual. J 


4.3 RELATIONS TO THE SIMPLEX PROCEDURE 


In this section the Duality Theorem is proved by making explicit use of the charac- 
teristics of the simplex procedure. As a result of this proof it becomes clear that 
once the primal is solved by the simplex procedure a solution to the dual is readily 
obtainable. 

Suppose that for the linear program 


minimize c’x 
subject to Ax=b (5) 
x>0, 


we have the optimal basic feasible solution x = (xg, 0) with corresponding basis B. 
We shall determine a solution of the dual program 


maximize Ab (6) 
subject to ATA <c! 


in terms of B. 
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We partition A as A = [B, D]. Since the basic feasible solution xz = B~'b is 
optimal, the relative cost vector r must be nonnegative in each component. From 
Section 3.7 we have 


Poe. het 
Ip =Cp—¢gBD, 
and since rp is nonnegative in each component we have c,B"'D < cp. 


Now define A’ = ¢,B~!. We show that this choice of A solves the dual problem. 
We have 


ATA =[A'B, A'D] = [cg, cgB 'D] < [eg. ep] =e". 
Thus since A’A < ce’, A is feasible for the dual. On the other hand, 
Vb=¢,P “b=c.ey 


and thus the value of the dual objective function for this A is equal to the value 
of the primal problem. This, in view of Lemma 1, Section 4.2, establishes the 
optimality of A for the dual. The above discussion yields an alternative derivation 
of the main portion of the Duality Theorem. 


Theorem. Let the linear program (5) have an optimal basic feasible solution 
corresponding to the basis B. Then the vector X satisfying A" = ¢,B™' is an 
optimal solution to the dual program (6). The optimal values of both problems 
are equal. 


We turn now to a discussion of how the solution of the dual can be obtained 
directly from the final simplex tableau of the primal. Suppose that embedded in the 
original matrix A is an m x m identity matrix. This will be the case if, for example, 
m slack variables are employed to convert inequalities to equalities. Then in the 
final tableau the matrix B~! appears where the identity appeared in the beginning. 
Furthermore, in the last row the components corresponding to this identity matrix 
will be ej —cgB~!, where c, is the m-vector representing the cost coefficients of 
the variables corresponding to the columns of the original identity matrix. Thus by 
subtracting these cost coefficients from the corresponding elements in the last row, 
the negative of the solution A’ = c,B™' to the dual is obtained. In particular, if, as 
is the case with slack variables, c; = 0, then the elements in the last row under B! 
are equal to the negative of components of the solution to the dual. 


Example. Consider the primal program 


minimize — x, — 4x, — 3x, 
subject to 2x, +2x,+ x3,<4 
x, + 2x, +2x,<6 
x, 20, x,20, 4x,20. 
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This can be solved by introducing slack variables and using the simplex procedure. 
The appropriate sequence of tableaus is given below without explanation. 


2 1 ft & 4 


i & 2 © £ £ 
=1 4 =3 © © 6 
i 1 #2 tf 0 2 
= 2 (=f «4 2 
3 O -l 2 O 8 


32.7 © xX =i 4 
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x 0 GO F Tf 10 


The optimal solution is x, = 0, x, = 1, x, = 2. The corresponding dual program is 


maximize 4A, + 6A, 
subject to 2A, + A,<-l 
2d, +2A, < —4 
A, +2A, <—3 
A, <0, A, <0. 


The optimal solution to the dual is obtained directly from the last row of the simplex 
tableau under the columns where the identity appeared in the first tableau: A, = —1, 
A,=-1. 


Geometric Interpretation 


The duality relations can be viewed in terms of the dual interpretations of linear 
constraints emphasized in Chapter 3. Consider a linear program in standard form. 
For sake of concreteness we consider the problem 


minimize 18x, + 12x, +2x;+ 6x, 
subject to 3x,+ Xx,—2x3+ x,=2 
X,+ 3x, — X=2 
x,20, »,20, 4,20, x, 20. 


The columns of the constraints are represented in requirements space in Fig. 4.2. 
A basic solution represents construction of b with positive weights on two of the 
a;’s. The dual problem is 


maximize 2A, +2A, 
subject to 3A, + A, < 18 
A, + 3A, < 12 
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a4 


Fig. 4.2 The primal requirements space 


The dual problem is shown geometrically in Fig. 4.3. Each column a, of the 
primal defines a constraint of the dual as a half-space whose boundary is orthogonal 
to that column vector and is located at a point determined by c;. The dual objective 
is maximized at an extreme point of the dual feasible region. At this point exactly 
two dual constraints are active. These active constraints correspond to an optimal 
basis of the primal. In fact, the vector defining the dual objective is a positive linear 
combination of the vectors. In the specific example, b is a positive combination 
of a, and a,. The weights in this combination are the x,’s in the solution of the 
primal. 


a; 


>A, 


a4 


Fig. 4.3 The dual in activity space 
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Simplex Multipliers 


We conclude this section by giving an economic interpretation of the relation 
between the simplex basis and the vector A. At any point in the simplex procedure 
we may form the vector A satisfying A’ = c,B~!. This vector is not a solution 
to the dual unless B is an optimal basis for the primal, but nevertheless, it has an 
economic interpretation. Furthermore, as we have seen in the development of the 
revised simplex method, this A vector can be used at every step to calculate the 
relative cost coefficients. For this reason A’ = c,B™', corresponding to any basis, 
is often called the vector of simplex multipliers. 

Let us pursue the economic interpretation of these simplex multipliers. As 
usual, denote the columns of A by aj, a),...,a, and denote by e,, e),...,e,, the 
m unit vectors in E”’. The components of the a,’s and b tell how to construct these 
vectors from the e;’s. 

Given any basis B, however, consisting of m columns of A, any other vector 
can be constructed (synthetically) as a linear combination of these basis vectors. 
If there is a unit cost c; associated with each basis vector a;, then the cost of a 
(synthetic) vector constructed from the basis can be calculated as the corresponding 
linear combination of the c;’s associated with the basis. In particular, the cost of 
the jth unit vector, e,, when constructed from the basis B, is dj, the jth component 
of A’ = chB™'. Thus the ,’s can be interpreted as synthetic prices of the unit 
vectors. 

Now, any vector can be expressed in terms of the basis B in two steps: 
(i) express the unit vectors in terms of the basis, and then (ii) express the desired 
vector as a linear combination of unit vectors. The corresponding synthetic cost of 
a vector constructed from the basis B can correspondingly be computed directly by: 
(i) finding the synthetic price of the unit vectors, and then (ii) using these prices 
to evaluate the cost of the linear combination of unit vectors. Thus, the simplex 
multipliers can be used to quickly evaluate the synthetic cost of any vector that 
is expressed in terms of the unit vectors. The difference between the true cost of 
this vector and the synthetic cost is the relative cost. The process of calculating 
the synthetic cost of a vector, with respect to a given basis, by using the simplex 
multipliers is sometimes referred to as pricing out the vector. 

Optimality of the primal corresponds to the situation where every vector 


a,, a,...,a, is cheaper when constructed from the basis than when purchased 
directly at its own price. Thus we have A’a, < c; fori=1,2,...,n or equivalently 
NAK c!. 


4.4 SENSITIVITY AND COMPLEMENTARY 
SLACKNESS 


The optimal values of the dual variables in a linear program can, as we have seen, 
be interpreted as prices. In this section this interpretation is explored in further 
detail. 
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Sensitivity 


Suppose in the linear program 


minimize c’x 
subject to Ax=b (7) 
x>0, 


the optimal basis is B with corresponding solution (xg, 0), where x, = B'b. A 
solution to the corresponding dual is A’ = cg B"!. 

Now, assuming nondegeneracy, small changes in the vector b will not cause 
the optimal basis to change. Thus for b+ Ab the optimal solution is 


xX = (xp + Axg, 0), 
where Ax, = B~'Ab. Thus the corresponding increment in the cost function is 
Az = cy Ax, = A’ AD. (8) 


This equation shows that A gives the sensitivity of the optimal cost with respect to 
small changes in the vector b. In other words, if a new program were solved with b 
changed to b+ Ab, the change in the optimal value of the objective function would 
be A Ab. 

This interpretation of the dual vector A is intimately related to its interpretation 
as a vector of simplex multipliers. Since \; is the price of the unit vector e; when 
constructed from the basis B, it directly feces the change in cost due to a change 
in the jth component of the vector b. Thus, 4; may equivalently be considered as 
the marginal price of the component b,, since if b; is changed to b; + Ab; the value 
of the optimal solution changes by i, Ab, 

If the linear program is iter cdl a a diet problem, for instance, then ), is 
the maximum price per unit that the dietician would be willing to pay for a small 
amount of the jth nutrient, because decreasing the amount of nutrient that must 
be supplied by food will reduce the food bill by 4, dollars per unit. If, as another 
example, the linear program is interpreted as the protien faced by a manufacturer 
who must select levels x,, x,,...,x, Of m production activities in order to meet 
certain required levels of output b,, b,,...,b,, while minimizing production costs, 
the A,’s are the marginal prices of the outputs. They show directly how much the 
production cost varies if a small change is made in the output levels. 


Complementary Slackness 


The optimal solutions to primal and dual programs satisfy an additional relation 
that has an economic interpretation. This relation can be stated for any pair of dual 
linear programs, but we state it here only for the asymmetric and the symmetric 
pairs defined in Section 4.1. 
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Theorem I (Complementary slackness—asymmetric form). Let x and A be 

feasible solutions for the primal and dual programs, respectively, in the pair (2). 
A necessary and sufficient condition that they both be optimal solutions is that’ 
for alli 


i) OSA’ eS, 
i) B=0 GA a, <c, 


Proof. If the stated conditions hold, then clearly (A7A —c’)x = 0. Thus A’b = 
c’x, and by the corollary to Lemma 1, Section 4.2, the two solutions are optimal. 
Conversely, if the two solutions are optimal, it must hold, by the Duality Theorem, 
that A’b = c’x and hence that (A7A —c?’)x = 0. Since each component of x is 
nonnegative and each component of A? A —c? is nonpositive, the conditions (i) and 
(ii) must hold. J 


Theorem 2 (Complementary slackness—symmetric form). Let x and 2 be 
feasible solutions for the primal and dual programs, respectively, in the pair (1). 
A necessary and sufficient condition that they both be optimal solutions is that 
for all i and j 


) £20SN2S6 
ii) x, =O Ala, <c; 
ili) Aj, >O0>a/x=), 
iv) A, =O aix>b, 


(where a! is the jth row of A). 
Proof. This follows by transforming the previous theorem. J 


The complementary slackness conditions have a rather obvious economic inter- 
pretation. Thinking in terms of the diet problem, for example, which is the primal 
part of a symmetric pair of dual problems, suppose that the optimal diet supplies 
more than b; units of the jth nutrient. This means that the dietician would be 
unwilling to pay anything for small quantities of that nutrient, since availability 
of it would not reduce the cost of the optimal diet. This, in view of our previous 
interpretation of A; as a marginal price, implies A; = 0 which is (iv) of Theorem 2. 
The other sondiitions have similar interpretations which the reader can work out. 


*4.5 THE DUAL SIMPLEX METHOD 


Often there is available a basic solution to a linear program which is not feasible 
but which prices out optimally; that is, the simplex multipliers are feasible for 
the dual problem. In the simplex tableau this situation corresponds to having no 
negative elements in the bottom row but an infeasible basic solution. Such a situation 
may arise, for example, if a solution to a certain linear programming problem is 


*The symbol > means “implies” and & means “is implied by.” 
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calculated and then a new problem is constructed by changing the vector b. In such 
situations a basic feasible solution to the dual is available and hence it is desirable 
to pivot in such a way as to optimize the dual. 

Rather than constructing a tableau for the dual problem (which, if the primal is 
in standard form; involves m free variables and n nonnegative slack variables), it is 
more efficient to work on the dual from the primal tableau. The complete technique 
based on this idea is the dual simplex method. In terms of the primal problem, 
it operates by maintaining the optimality condition of the last row while working 
toward feasibility. In terms of the dual problem, however, it maintains feasibility 
while working toward optimality. 

Given the linear program 


minimize c’x 
subject to Ax=b (9) 
x>0, 


suppose a basis B is known such that A defined by A’ = ¢,B™'! is feasible for 
the dual. In this case we say that the corresponding basic solution to the primal, 
Xz = B™'b, is dual feasible. If x, > 0 then this solution is also primal feasible and 
hence optimal. 

The given vector A is feasible for the dual and thus satisfies Ata, KC; for 
j=1,2,...,n. Indeed, assuming as usual that the basis is the first m columns of 
A, there is equality 


A’a,=c,, for j=1,2,...,m, (10a) 


J J 


and (barring degeneracy in the dual) there is inequality 
Ata; <c;, for j=m+l,...,n. (10b) 


To develop one cycle of the dual simplex method, we find a new vector A such that 
one of the equalities becomes an inequality and one of the inequalities becomes 
equality, while at the same time increasing the value of the dual objective function. 
The m equalities in the new solution then determine a new basis. 

Denote the ith row of B~! by u’. Then for 


AT =A" eu, (11) 


we have Ata, = A‘a,— eu‘a,. Thus, recalling that z; = A‘a, and noting that u'a; = 
Yip» the ijth element of the tableau, we have 
MM aj= ec), j=l,2,....m, iFj (12a) 
A’a,=c,—€ (12b) 
j=m+l1, m+4+2,...,n. (12c) 


ee 
Aaj = 2) — EV 
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Also, 
A "b= A'b— exg,. (13) 


These last equations lead directly to the algorithm: 


Step 1. Given a dual feasible basic solution xg, if xg > 0 the solution is optimal. If 
Xp is not nonnegative, select an index i such that the ith component of Xg, Xp; < 0. 


Step 2. If all Vij 2 0, 7=1,2,...,7, then the dual has no maximum (this follows 
since by (12) A is feasible for all ¢ > 0). If y,; < 0 for some j, then let 


— = c 
e)= 9 = min | 3 iy, <0}. (14) 
Vik J 


Jij 


Step 3. Form a new basis B by replacing a; by a,. Using this basis determine the 
corresponding basic dual feasible solution x, and return to Step 1. 


The proof that the algorithm converges to the optimal solution is similar in its 
details to the proof for the primal simplex procedure. The essential observations 
are: (a) from the choice of k in (14) and from (12a, b, c) the new solution will 
again be dual feasible; (b) by (13) and the choice xg <0, the value of the dual 
objective will increase; (c) the procedure cannot terminate at a nonoptimum point; 
and (d) since there are only a finite number of bases, the optimum must be achieved 
in a finite number of steps. 


Example. A form of problem arising frequently is that of minimizing a positive 
combination of positive variables subject to a series of “greater than” type inequal- 
ities having positive coefficients. Such problems are natural candidates for appli- 
cation of the dual simplex procedure. The classical diet problem is of this type as 
is the simple example below. 


minimize 3x, + 4x, +5x,; 
subject to =x, + 2x, +3x,; >5 
2x, +2x,+ x,26 
x, 20, x,20, x3, 20. 


By introducing surplus variables and by changing the sign of the inequalities we 
obtain the initial tableau 


=] <2 =3 if @ <5 
=2 2 =1 0 1 =6 
3 4 500 0 


Initial tableau 
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The basis corresponds to a dual feasible solution since all of the c;—z,’s are 
nonnegative. We select any xg. < 0, say x; = —6, to remove from the set of basic 
variables. To find the appropriate pivot element in the second row we compute 
the ratios (z; —c;)/y; and select the minimum positive ratio. This yields the pivot 
indicated. Continuing, the remaining tableaus are 


¢ =- <9. 1 —i2 2 
1 i i2 0 =12 3 
1 72 O 32 9 


Second tableau 
1 5/2 -1 1/2 2 
—2 1 —1 
0 1 1 1 11 


—_ 
j=) 


Final tableau 


The third tableau yields a feasible solution to the primal which must be optimal. 
Thus the solution is x, = 1, x, =2, x, =0. 


*4.6 THE PRIMAL—DUAL ALGORITHM 


In this section a procedure is described for solving linear programming problems by 
working simultaneously on the primal and the dual problems. The procedure begins 
with a feasible solution to the dual that is improved at each step by optimizing an 
associated restricted primal problem. As the method progresses it can be regarded 
as striving to achieve the complementary slackness conditions for optimality. Origi- 
nally, the primal—dual method was developed for solving a special kind of linear 
program arising in network flow problems, and it continues to be the most efficient 
procedure for these problems. (For general linear programs the dual simplex method 
is most frequently used). In this section we describe the generalized version of the 
algorithm and point out an interesting economic interpretation of it. We consider 
the program 


minimize c’x 
subject to Ax=b (15) 
x>0 


and the corresponding dual program 


maximize A’b (16) 
subject to ATA<c?. 

Given a feasible solution A to the dual, define the subset P of 1,2,...,n by 
ic P if A’a,; =c, where a, is the ith column of A. Thus, since A is dual feasible, 
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it follows that i ¢ P implies Aa; < c;. Now corresponding to A and P, we define 
the associated restricted primal problem 


minimize 17y 

subject to Ax+y=b (17) 
x>0, x,=0 for i¢P 
y20, 


where 1 denotes the m-vector (1, 1,..., 1). 
The dual of this associated restricted primal is called the associated restricted 
dual. It is 


maximize u’b 
subject to u’a; <0, i¢P (18) 
uc<l. 


The condition for optimality of the primal—dual method is expressed in the following 
theorem. 


Primal-Dual Optimality Theorem. Suppose that X is feasible for the dual 
and that x and y = 0 is feasible (and of course optimal) for the associated 
restricted primal. Then x and X are optimal for the original primal and dual 
programs, respectively. 


Proof. Clearly x is feasible for the primal. Also we have c?x = A’ Ax, because 
A‘A is identical to c’ on the components corresponding to nonzero elements of x. 
Thus e?x = A’Ax = A’b and optimality follows from Lemma 1, Section 4.2. 


The primal—dual method starts with a feasible solution to the dual and then 
optimizes the associated restricted primal. If the optimal solution to this associated 
restricted primal is not feasible for the primal, the feasible solution to the dual is 
improved and a new associated restricted primal is determined. Here are the details: 


Step 1. Given a feasible solution Ay to the dual program (16), determine the 
associated restricted primal according to (17). 


Step 2. Optimize the associated restricted primal. If the minimal value of this 
problem is zero, the corresponding solution is optimal for the original primal by 
the Primal—Dual Optimality Theorem. 


Step 3. Ifthe minimal value of the associated restricted primal is strictly positive, 
obtain from the final simplex tableau of the restricted primal, the solution uy of 
the associated restricted dual (18). If there is no j for which uj a; > 0 conclude the 
primal has no feasible solutions. If, on the other hand, for at least one j, u) a; > 0, 
define the new dual feasible vector 


A=Ap + Equy 
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where 


Now go back to Step | using this A. 


To prove convergence of this method a few simple observations and explana- 
tions must be made. First we verify the statement made in Step 3 that uj a, <0 
for all j implies that the primal has no feasible solution. The vector A, = Ag + EU 
is feasible for the dual problem for all positive ¢, since us A <0. In addition, 
Alb = Ajb+eujb and, since ujb = 1’y > 0, we see that as e is increased we 
obtain an unbounded solution to the dual. In view of the Duality Theorem, this 
implies that there is no feasible solution to the primal. 

Next suppose that in Step 3, for at least one j, uja ; > 0. Again we define 
the family of vectors A, = Ap + ey. Since Up is a solution to (18) we have 
u,a; < 0 for i€ P, and hence for small positive ¢ the vector A, is feasible for 
the dual. We increase e to the first point where one of inequalities Ala, <C;, 
j ¢P becomes an equality. This determines ¢, > 0 and k. The new A vector 
corresponds to an increased value of the dual objective A7b = A}b+ eujb. In 
addition, the corresponding new set P now includes the index k. Any other index i 
that corresponded to a positive value of x; in the associated restricted primal is in 
the new set P, because by complementary slackness uj a; = 0 for such an i and thus 
A‘a; = Aja; + €9uja; = c;. This means that the old optimal solution is feasible for 
the new associated restricted primal and that a, can be pivoted into the basis. Since 
uj a, > 0, pivoting in a, will decrease the value of the associated restricted primal. 

In summary, it has been shown that at each step either an improvement in 
the associated primal is made or an infeasibility condition is detected. Assuming 
nondegeneracy, this implies that no basis of the associated primal is repeated—and 
since there are only a finite number of possible bases, the solution is reached in a 
finite number of steps. 

The primal—dual algorithm can be given an interesting interpretation in terms 
of the manufacturing problem in Example 3, Section 2.2. Suppose we own a facility 
that is capable of engaging in n different production activities each of which 
produces various amounts of m commodities. Each activity i can be operated at any 
level x; > 0, but when operated at the unity level the ith activity costs c; dollars and 
yields the m commodities in the amounts specified by the m-vector a;. Assuming 
linearity of the production facility, if we are given a vector b describing output 
requirements of the m commodities, and we wish to produce these at minimum 
cost, ours is the primal problem. 

Imagine that an entrepreneur not knowing the value of our requirements vector 
b decides to sell us these requirements directly. He assigns a price vector A, to 
these requirements such that Aj A <c. In this way his prices are competitive with 
our production activities, and he can assure us that purchasing directly from him is 
no more costly than engaging activities. As owner of the production facilities we are 
reluctant to abandon our production enterprise but, on the other hand, we deem it not 
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frugal to engage an activity whose output can be duplicated by direct purchase for 
lower cost. Therefore, we decide to engage only activities that cannot be duplicated 
cheaper, and at the same time we attempt to minimize the total business volume 
given the entrepreneur. Ours is the associated restricted primal problem. 

Upon receiving our order, the greedy entrepreneur decides to modify his prices 
in such a manner as to keep them competitive with our activities but increase the 
cost of our order. As a reasonable and simple approach he seeks new prices of 
the form 


A=A)+ Eup, 
where he selects up as the solution to 


maximize uly 
subject to u’a, <0, ieP 
uc<l. 


The first set of constraints is to maintain competitiveness of his new price vector for 
small ¢, while the second set is an arbitrary bound imposed to keep this subproblem 
bounded. It is easily shown that the solution uy to this problem is identical to the 
solution of the associated dual (18). After determining the maximum « to maintain 
feasibility, he announces his new prices. 

At this point, rather than concede to the price adjustment, we recalculate the new 
minimum volume order based on the new prices. As the greedy (and shortsighted) 
entrepreneur continues to change his prices in an attempt to maximize profit he 
eventually finds he has reduced his business to zero! At that point we have, with 
his help, solved the original primal problem. 


Example. To illustrate the primal—dual method and indicate how it can be imple- 
mented through use of the tableau format consider the following problem: 


minimize 2x, +x,+4x,; 
subject to. =x, +x, +2x;=3 
2x, +X, +3x;=5 
x,20, 420, 320. 


Because all of the coefficients in the objective function are nonnegative, A = (0, 0) 
is a feasible vector for the dual. We lay out the simplex tableau shown below 


a, a, a, HS b 
1 1 2 1 0 3 
2 1 3 0 1 5 
—3 —-2 j-5 0 0 -8 


First tableau 
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To form this tableau we have adjoined artificial variables in the usual manner. 
The third row gives the relative cost coefficients of the associated primal problem— 
the same as the row that would be used in a phase I procedure. In the fourth row 
are listed the c;— A’a,’s for the current A. The allowable columns in the associated 
restricted primal are determined by the zeros in this last row. 

Since there are no zeros in the last row, no progress can be made in the 
associated restricted primal and hence the original solution x, = x, =x; =0, y, =3, 
y, =5 is optimal for this A. The solution uy to the associated restricted dual is 
uy = (1, 1), and the numbers —u) a;, i= 1, 2,3 are equal to the first three elements 


in the third row. Thus, we compute the three ratios , i, 4 from which we find 
E = i The new values for the fourth row are now found by adding é, times the 


(first three) elements of the third row to the fourth row. 


a, a, a; : b 
1 dy -2 1 0 3 
2 1 3 0 1 5 
=3 =2 =5 0 0 -8 


1/2 0 3/2 


Second tableau 


Minimizing the new associated restricted primal by pivoting as indicated we obtain 


a, a a, : : b 
1 1 2 1 0 3 
1 0 1 -l 1 2 
-l 0 -l 2 QO -2 
1/72 0 3/2 , 
Now we again calculate the ratios 5, } obtaining e) = }, and add this multiple of 
the third row to the fourth row to obtain the next tableau. 
a, a, a; : b 
1 1 2 1 0 3 
@ o 1 -1 1 2 
-l 0 -l QO -2 
0 i 


Third tableau 


Optimizing the new restricted primal we obtain the tableau: 


a, a, a, : b 
0 1 1 2 - 1 
1 0 1 -1 2 
0 0 0 1 1 0 
0 0 1 : 


Final tableau 
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Having obtained feasibility in the primal, we conclude that the solution is also 
optimal: x, = 2, x, = 1, x, =0. 


“4.7 REDUCTION OF LINEAR INEQUALITIES 


Linear programming is in part the study of linear inequalities, and each progressive 
stage of linear programming theory adds to our understanding of this important 
fundamental mathematical structure. Development of the simplex method, for 
example, provided by means of artificial variables a procedure for solving such 
systems. Duality theory provides additional insight and additional techniques for 
dealing with linear inequalities. 

Consider a system of linear inequalities in standard form 


Ax=b 
(19) 
x 


0, 


W 


where A is an m Xn matrix, b is a constant nonzero m-vector, and x is a variable 
n-vector. Any point x satisfying these conditions is called a solution. The set of 
solutions is denoted by S. 

It is the set S that is of primary interest in most problems involving systems 
of inequalities—the inequalities themselves acting merely to provide a description 
of S. Alternative systems having the same solution set S are, from this viewpoint, 
equivalent. In many cases, therefore, the system of linear inequalities originally used 
to define S may not be the simplest, and it may be possible to find another system 
having fewer inequalities or fewer variables while defining the same solution set S. 
It is this general issue that is explored in this section. 


Redundant Equations 


One way that a system of linear inequalities can sometimes be simplified is by the 
elimination of redundant equations. This leads to a new equivalent system having 
the same number of variables but fewer equations. 


Definition. Corresponding to the system of linear inequalities 


Ax=b 


7 (19) 


».4 
x 


W 


we say the system has redundant equations if there is a nonzero A € E” 
satisfying 
ATA=0 


20 
A’b=0. 2m) 
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This definition is equivalent, as the reader is aware, to the statement that a 
system of equations is redundant if one of the equations can be expressed as a linear 
combination of the others. In most of our previous analysis we have assumed, for 
simplicity, that such redundant equations were not present in our given system or 
that they were eliminated prior to further computation. Indeed, such redundancy 
presents no real computational difficulty, since redundant equations are detected and 
can be eliminated during application of the phase I procedure for determining a basic 
feasible solution. Note, however, the hint of duality even in this elementary concept. 


Null Variables 


Definition. Corresponding to the system of linear inequalities 


Ax=b 


‘i: (21) 


».4 
x 


W 


a variable x, is said to be a null variable if x; = 0 in every solution. 


It is clear that if it were known that a variable x; were a null variable, then the 
solution set S could be equivalently described by the system of linear inequalities 
obtained from (21) by deleting the ith column of A, deleting the inequality x; > 0, 
and adjoining the equality x; = 0. This yields an obvious simplification in the 
description of the solutions set S. It is perhaps not so obvious how null variables 
can be identified. 


Example. As a simple example of how null variables may appear consider the 
system 


2x, + 3x, + 4x; + 4x, =6 
X; + X+2x,+ xy =3 
x, 20, x, >0, x3 20, x, > 0. 


By subtracting twice the second equation from the first we obtain 
Xp +2x,=0. 


Since the x,’s must all be nonnegative, it follows immediately that x, and x, are 
zero in any solution. Thus x, and x, are null variables. 

Generalizing from the above example it is clear that if a linear combination of 
the equations can be found such that the right-hand side is zero while the coefficients 
on the left side are all either zero or positive, then the variables corresponding to 
the positive coefficients in this equation are null variables. In other words, if from 
the original system it is possible to combine equations so as to yield 


E\x, + GX) +++ + €,x, = 0 


with €, >0,i=1,2,...,n, then €, > 0 implies that x, is a null variable. 
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The above elementary observations clearly can be used to identify null variables 
in some cases. A more surprising result is that the technique described above can 
be used to identify all null variables. The proof of this fact is based on the Duality 
Theorem. 


Null Value Theorem. If S is not empty, the variable x; is a null variable in 
the system (21) if and only if there is a nonzero vector X € E” such that 


ATA>0 


22 
A’b=0. 2) 


and the ith component of A’ A is strictly positive. 


Proof. The “if” part follows immediately from the discussion above. To prove the 
“only if’ part, suppose that x; is a null variable, and suppose that S is not empty. 
Consider the program 


minimize —e’x 


subject to Ax=b 
x>0, 


where e’ is the ith unit row vector. By our hypotheses, there is a feasible solution 
and the optimal value is zero. By the Duality Theorem the dual program 


maximize A’b 
subject to ATA < —e’ 


is also feasible and has optimal value zero. Thus there is a A with 


AA <-e! 
A b=0. 


Changing the sign of A proves the theorem. 


Nonextremal Variables 
Example 1. Consider the system of linear inequalities 


X,+3x,+4x, =4 
2x, +x,+3x;=6 (23) 


x, 20, xX, > 0, x3 > 0. 
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By subtracting the second equation from the first and rearranging, we obtain 
Xy = 2+ 2x5 +X. (24) 


From this we observe that since x, and x, are nonnegative, the value of x, is greater 
than or equal to 2 in any solution to the equalities. This means that the inequality 
x, > 0 can be dropped from the original set, and x, can be treated as a free variable 
even though the remaining inequalities actually do not allow complete freedom. 
Hence x, can be replaced everywhere by (24) in the original system (23) leading to 


5x) +5x3 = 2 
Ky. 0; x,>0 (25) 


x) = 242%) + x3. 


The first two lines of (25) represent a system of linear inequalities in standard 
form with one less variable and one less equation than the original system. The last 
equation is a simple linear equation from which x, is determined by a solution to 
the smaller system of inequalities. 

This example illustrates and motivates the concept of a nonextremal variable. 
As illustrated, the identification of such nonextremal variables results in a significant 
simplification of a system of linear inequalities. 


Definition. A variable x; in the system of linear inequalities 


(26) 


is nonextremal if the inequality x; > 0 in (26) is redundant. 


A nonextremal variable can be treated as a free variable, and thus can be 
eliminated from the system by using one equation to define that variable in terms 
of the other variables. The result is a new system having one less variable and one 
less equation. Solutions to the original system can be obtained from solutions to the 
new system by substituting into the expression for the value of the free variable. 

It is clear that if, as in the example, a linear combination of the equations in 
the system can be found that implies that x; is nonnegative if all other variables are 
nonnegative, then x; is nonextremal. That the converse of this statement is also true 
is perhaps not so obvious. Again the proof of this is based on the Duality Theorem. 


Nonextremal Variable Theorem. If S is not empty, the variable x; is a 
nonextremal variable for the system (26) if and only if there is X € E™ and 
d € E” such that 


NA=d', (27) 
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where 


d= =, d,>0 for if/jJ; 
and such that 


A'b=-A8, (28) 
for some B > 0. 


Proof. The “if” part of the result is trivial, since forming the corresponding linear 
combination of the equations in (28) yields 


Xj SB x He Fp Xp tdi idXpr to FAXns 


which implies that x; is nonextremal. 

To prove the “only if” part, let a;, i= 1,2,..., denote the ith column of 
A. Let us assume that the solution set S is nonempty and that x; is nonextremal. 
Consider the linear program 


minimize x j 


subject to Ax=b (29) 


By hypothesis the minimum value is nonnegative, say it is B > 0. Then by the 
Duality Theorem the value of the dual program 


maximize A’b 
subject to A’a; <0, ixj 
Ata;=1 


is also B. Taking the negative of the optimal solution to the dual yields the desired 
result. Jf 


Nonextremal variables occur frequently in systems of linear inequalities. It can 
be shown, for instance, that every system having three nonnegative variables and 
two (independent) equations can be reduced to two non-negative variables and one 
equation. 


Applications 


Each of the reduction concepts can be applied by searching for a A satisfying an 
appropriate system of linear inequalities. This can be done by application of the 
simplex method. Thus, the theorems above translate into systematic procedures for 
reducing a system. 

The reduction methods described in this section can be applied to any linear 
program in an effort to simplify the representation of the feasible region. Of course, 
for the purpose of simply solving a given linear program the reduction process is 
not particularly worthwhile. However, when considering a large problem that will 
be solved many times with different objective functions, or a problem with linear 
constraints but a nonlinear objective, the reduction procedure can be valuable. 
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Fig. 4.4 Redundant inequality 


One interesting area of application is the elimination of redundant inequality 
constraints. Consider the region shown in Fig. 4.4 defined by the nonnegativity 
constraint and three other linear inequalities. The system can be expressed as 


alx <b, aes by; xs b,, x>0, (30) 
which in standard form is 
alx+y,=b, ax+y,=b, axt+y,;=b,, x20, y>0. (31) 


The third constraint is, as seen from the figure, redundant and can be eliminated 
without changing the solution set. In the standard form (31) this is reflected in 
the fact that y; is nonextremal and hence it, together with the third constraint, can 
be eliminated. This special example generalizes, of course, to higher dimensional 
problems involving many inequalities where, in general, redundant inequalities 
show up as having nonextremal slack variables. The detection and elimination of 
such redundant inequalities can be helpful in the cutting-plane methods (discussed 
in Chapter 14) where inequalities are continually appended to a system as the 
method progresses. 


4.8 EXERCISES 


1. Verify in detail that the dual of a linear program is the original problem. 


2. Show that if a linear inequality in a linear program is changed to equality, the corre- 
sponding dual variable becomes free. 
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3. Find the dual of 


minimize c¢’x 

subject to Ax=b 
x2a 

where a>0. 


4. Show that in the transportation problem the linear equality constraints are not linearly 
independent, and that in an optimal solution to the dual problem the dual variables are 
not unique. Generalize this observation to any linear program having redundant equality 
constraints. 


5. Construct an example of a primal problem that has no feasible solutions and whose 
corresponding dual also has no feasible solutions. 


6. Let A be an m xn matrix and b be an n-vector. Prove that Ax < 0 implies c’x < 0 if 
and only if ce’ = A’A for some A > 0. Give a geometric interpretation of the result. 


7. There is in general a strong connection between the theories of optimization and free 
competition, which is illustrated by an idealized model of activity location. Suppose 
there are n economic activities (various factories, homes, stores, etc.) that are to be 
individually located on n distinct parcels of land. If activity i is located on parcel j that 
activity can yield s,; units (dollars) of value. 

If the assignment of activities to land parcels is made by a central authority, it might 
be made in such a way as to maximize the total value generated. In other words, the 
assignment would be made so as to maximize }); )°; 5;;x;; where 


1 if activity 7 is assigned to parcel j 


Mee 
te 0 otherwise. 


More explicitly this approach leads to the optimization problem 
maximize 0) 8jj%ij 
subject to Yaad, i=1,2,...65n 
Saye, PH 12 ysa 
"x4, 20, X;; =O or I. 


Actually, it can be shown that the final requirement (x;; = 0 or 1) is automatically 
satisfied at any extreme point of the set defined by the other constraints, so that in fact the 
optimal assignment can be found by using the simplex method of linear programming. 

If one considers the problem from the viewpoint of free competition, it is assumed 
that, rather than a central authority determining the assignment, the individual activities 
bid for the land and thereby establish prices. 


a) Show that there exists a set of activity prices p;,i=1,2,...,n and land prices 
qj J= 1,2,..., such that 


Di+ 4; 2 Sj; ih ee JSD cug tt 


with equality holding if in an optimal assignment activity i is assigned to parcel j. 
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b) Show that Part (a) implies that if activity i is optimally assigned to parcel j and if j’ 
is any other parcel 


Sig 9j 2 Si — F- 


Give an economic interpretation of this result and explain the relation between free 
competition and optimality in this context. 

c) Assuming that each s;; is positive, show that the prices can all be assumed to be 
nonnegative. 


. Game theory is in part related to linear programming theory. Consider the game in 
which player X may select any one of m moves, and player Y may select any one 
of n moves. If X selects i and Y selects j, then X wins an amount a;; from Y. The 
game is repeated many times. Player X develops a mixed strategy where the various 


moves are played according to probabilities represented by the components of the vector 
m 


X = (X,,.%X5,...,X,,), Where x, >0,i=1,2,...,m and >)? x, = 1. Likewise Y develops 
i=l 
a mixed strategy y = (y,, y>,.--, y,), where y, >0,i=1,2,...,m and >> y, = 1. The 
i=l 


average payoff to X is then P(x, y) = x7 Ay. 


a) Suppose X selects x as the solution to the linear program 


maximize <A 


subject to Six,=1 


Show that X is guaranteed a payoff of at least A no matter what y is chosen by Y. 


b) Show that the dual of the problem above is 


minimize B 
n 
subject to Yi y,;=1 
j=l 


n 
» aijy; <B, i=1,2,...,m 
j=l 


y; 29, = 12, cast 


c) Prove that max A = min B. (The common value is called the value of the game.) 

d) Consider the “matching” game. Each player selects heads or tails. If the choices 
match, X wins $1 from Y; if they do not match, Y wins $1 from X. Find the value 
of this game and the optimal mixed strategies. 

e) Repeat Part (d) for the game where each player selects either 1, 2, or 3. The player 
with the highest number wins $1 unless that number is exactly 1 higher than the 
other player’s number, in which case he loses $3. When the numbers are equal there 
is no payoff. 
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10. 


11. 
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Consider the primal linear program 


minimize c¢’x 
subject to Ax=b 
x>0. 


Suppose that this program and its dual are feasible. Let A be a known optimal solution 
to the dual. 


a) Ifthe kth equation of the primal is multiplied by 4. 4 0, determine an optimal solution 
w to the dual of this new problem. 

b) Suppose that, in the original primal, we add w times the kth equation to the rth 
equation. What is an optimal solution w to the corresponding dual problem? 

c) Suppose, in the original primal, we add mw times the kth row of A to c. What is an 
optimal solution to the corresponding dual problem? 


Consider the linear program (P) of the form 


T 


minimize q’z 
subject to Mz>-—-q 
z>0 
in which the matrix M is skew symmetric; that is, M=—M?’. 


(a) Show that problem (P) and its dual are the same. 
(b) A problem of the kind in part (a) is said to be self-dual. An example of a self-dual 


problem has 
0 —A’ c x 
mala of aL): *=[5} 


Give an interpretation of the problem with this data. 
(c) Show that a self-dual linear program has an optimal solution if and only if it is 
feasible. 


A company may manufacture n different products, each of which uses various amounts 
of m limited resources. Each unit of product i yields a profit of c; dollars and uses aj; 
units of the jth resource. The available amount of the jth resource is b;. To maximize 
profit the company selects the quantities x; to be manufactured of each product by 
solving 


maximize c’x 
subject to Ax=b 
x>0. 


The unit profits c; already take into account the variable cost associated with manufac- 
turing each unit. In addition to that cost, the company incurs a fixed overhead H, and 
for accounting purposes it wants to allocate this overhead to each of its products. In 
other words, it wants to adjust the unit profits so as to account for the overhead. Such an 
overhead allocation scheme must satisfy two conditions: (1) Since H is fixed regardless 
of the product mix, the overhead allocation scheme must not alter the optimal solution, 
(2) All the overhead must be allocated; that is, the optimal value of the objective with 
the modified cost coefficients must be H dollars lower than z—the original optimal 
value of the objective. 
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a) Consider the allocation scheme in which the unit profits are modified according to 
¢’ =c’ —rAjA, where Ag is the optimal solution to the original dual and r = H/z 
(assume H < Zp). 


i) Show that the optimal x for the modified problem is the same as that for the 
original problem, and the new dual solution is Ay = (1—r)Ap. 
ii) Show that this approach fully allocates H. 


b) Suppose that the overhead can be traced to each of the resource constraints. Let 
m 


H, > 0 be the amount of overhead associated with the ith resource, where >> H; < Zo 
i=l 

and r; = H,/b; <9 for i=1,..., m. Based on this information, an allocation scheme 

has been proposed where the unit profits are modified such that ¢7 = ec’ —r7A. 


i) Show that the optimal x for this modified problem is the same as that for the 
original problem, and the corresponding dual solution is Ay = Ay — Fr. 
ii) Show that this scheme fully allocates H. 


12. Solve the linear inequalities 


—2x,+ 2x,<-1 
2xX,—- XxX, <2 
— 4x, <3 
—15x, — 12x, < —2 
12x, + 20x, < —1. 


Note that x, and x, are not restricted to be positive. Solve this problem by considering 
the problem of maximizing 0.x, +0-x, subject to these constraints, taking the dual and 
using the simplex method. 


13. a) Using the simplex method solve 


minimize 2X1 — Xp 
subject to 2X; —X_y—X3 23 
X)—%X%, +x, 22 


(Hint: Note that x, =2 gives a feasible solution.) 
b) What is the dual problem and its optimal solution? 


14. a) Using the simplex method solve 


minimize 2x, +3x,+2x,+2x, 
subject to =x, +2x,+ 2x34+2x,=3 

X,+ X)+2x3+4x,=5 
x; > 0, i=1,2,3,4. 


i 


b) Using the work done in Part (a) and the dual simplex method, solve the same problem 
but with the right-hand sides of the equations changed to 8 and 7 respectively. 
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15. 


16. 


Lif, 


18. 
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For the problem 


minimize 5x,+3x, 


x, 20, » 20, 20; 


a) Using a single pivot operation with pivot element 1, find a feasible solution. 
b) Using the simplex method, solve the problem. 

c) What is the dual problem? 

d) What is the solution to the dual? 


Solve the following problem by the dual simplex method: 


minimize —7x, + 7x, — 2x; — X4 — 6x5 


subject to 3x; — X)+ x3 — 2x4 =—3 
2X, + Xp + xt Xs = 4 
X, + 3X5, — 3%, +x, = 12 


and x, 20, %=1,...,6 


Given the linear programming problem in standard form (3) suppose a basis B and the 
corresponding (not necessarily feasible) primal and dual basic solutions x and A are 
known. Assume that at least one relative cost coefficient c;— A’a; is negative. Consider 
the auxiliary problem 


minimize c’x 


subject to Ax=b 


where T = {i: c;— A‘a; < O}, y is a slack variable, and M is a large positive constant. 
Show that if k is the index corresponding to the most negative relative cost coefficient 
in the original solution, then (A, c, — A’a,) is dual feasible for the auxiliary problem. 
Based on this observation, develop a big—/ artificial constraint method for the dual 
simplex method. (Refer to Exercise 24, Chapter 3.) 


A textile firm is capable of producing three products—x,, x,, x3. Its production plan for 
next month must satisfy the constraints 


X;+2x,+2x; < 12 
2x, +4%, +43 < f 
x, 20, »,20, x,;20. 
The first constraint is determined by equipment availability and is fixed. The second 


constraint is determined by the availability of cotton. The net profits of the products are 
2, 3, and 3, respectively, exclusive of the cost of cotton and fixed costs. 


19. 


20. 


Paw 


22. 


23. 


24. 


29; 


26. 
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a) Find the shadow price \, of the cotton input as a function of f. (Hint: Use the dual 
simplex method.) Plot \,(f) and the net profit z(f) exclusive of the cost for cotton. 

b) The firm may purchase cotton on the open market at a price of 1/6. However, it may 
acquire a limited amount at a price of 1/12 from a major supplier that it purchases 
from frequently. Determine the net profit of the firm 7(s) as a function of s. 


Consider the problem 


minimize 2x, +x, +4x; 


subject to X, +X) + 2x; =3 
2x, +x, +3x3=5 


x,;20, x, 20, x3 20. 


a) What is the dual problem? 
b) Note that A = (1,0) is feasible for the dual. Starting with this A, solve the primal 
using the primal—dual algorithm. 


Show that in the associated restricted dual of the primal—dual method the objective A7b 
can be replaced by A‘y. 


Given the system of linear inequalities (19), what is implied by the existence of a A 
satisfying ATA = 0, ADF 0? 


Suppose a system of linear inequalities possesses null variables. Show that when the 
null variables are eliminated, by setting them identically to zero, the resulting system 
will have redundant equations. Verify this for the example in Section 4.7. 


Prove that any system of linear inequalities in standard form having two equations and 
three variables can be reduced. 


Show that if a system of linear inequalities in standard form has a nondegenerate basic 
feasible solution, the corresponding nonbasic variables are extremal. 


Eliminate the null variables in the system 


2X)+ X_—X,+ Xyt xX5=2 
—X,+2x,4+2%3+2x%4,+ x5 = -1 


-X,— Xy —3x,+2x5=—-1 


Reduce to minimal size 


Xy+ X,.+2x3+ xy+ x5 =6 
3x, + x3 +5x,+4x,=4 
X,+ Xy— xX34+2x44+2x, = 3 
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Chapter5 INTERIOR-POINT 
METHODS 


Linear programs can be viewed in two somewhat complementary ways. They are, 
in one view, a class of continuous optimization problems each with continuous 
variables defined on a convex feasible region and with a continuous objective 
function. They are, therefore, a special case of the general form of problem 
considered in this text. However, linearity implies a certain degree of degeneracy, 
since for example the derivatives of all functions are constants and hence the differ- 
ential methods of general optimization theory cannot be directly used. From an 
alternative view, linear programs can be considered as a class of combinatorial 
problems because it is known that solutions can be found by restricting attention 
to the vertices of the convex polyhedron defined by the constraints. Indeed, this 
view is natural when considering network problems such as those of Chapter 6. 
However, the number of vertices may be large, up to n!/m!(n—m)!, making direct 
search impossible for even modest size problems. 

The simplex method embodies both of these viewpoints, for it restricts attention 
to vertices, but exploits the continuous nature of the variables to govern the progress 
from one vertex to another, defining a sequence of adjacent vertices with improving 
values of the objective as the process reaches an optimal point. The simplex method, 
with ever-evolving improvements, has for five decades provided an efficient general 
method for solving linear programs. 

Although it performs well in practice, visiting only a small fraction of the total 
number of vertices, a definitive theory of the simplex method’s performance was 
unavailable. However, in 1972, Klee and Minty showed by examples that for certain 
linear programs the simplex method will examine every vertex . These examples 
proved that in the worst case, the simplex method requires a number of steps that 
is exponential in the size of the problem. 

In view of this result, many researchers believed that a good algorithm, different 
than the simplex method, might be devised whose number of steps would be 
polynomial rather than exponential in the program’s size—that is, the time required 
to compute the solution would be bounded above by a polynomial in the size of 
the problem.! 


'We will be more precise about complexity notions such as “polynomial algorithm” in 
Section 5.1 below. 
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Indeed, in 1979, a new approach to linear programming, Khachiyan’s ellipsoid 
method was announced with great acclaim. The method is quite different in structure 
than the simplex method, for it constructs a sequence of shrinking ellipsoids each of 
which contains the optimal solution set and each member of the sequence is smaller 
in volume than its predecessor by at least a certain fixed factor. Therefore, the 
solution set can be found to any desired degree of approximation by continuing the 
process. Khachiyan proved that the ellipsoid method, developed during the 1970s 
by other mathematicians, is a polynomial-time algorithm for linear programming. 

Practical experience, however, was disappointing. In almost all cases, the 
simplex method was much faster than the ellipsoid method. However, Khachiyan’s 
ellipsoid method showed that polynomial time algorithms for linear programming 
do exist. It left open the question of whether one could be found that, in practice, 
was faster than the simplex method. 

It is then perhaps not surprising that the announcement by Karmarkar in 1984 
of a new polynomial time algorithm, an interior-point method, with the potential to 
improve the practical effectiveness of the simplex method made front-page news 
in major newspapers and magazines throughout the world. It is this interior-point 
approach that is the subject of this chapter and the next. 

This chapter begins with a brief introduction to complexity theory, which is the 
basis for a way to quantify the performance of iterative algorithms, distinguishing 
polynomial-time algorithms from others. 

Next the example of Klee and Minty showing that the simplex method is not 
a polynomial-time algorithm in the worst case is presented. Following that the 
ellipsoid algorithm is defined and shown to be a polynomial-time algorithm. These 
two sections provide a deeper understanding of how the modern theory of linear 
programming evolved, and help make clear how complexity theory impacts linear 
programming. However, the reader may wish to consider them optional and omit 
them at first reading. 

The development of the basics of interior-point theory begins with Section 5.4 
which introduces the concept of barrier functions and the analytic center. Section 5.5 
introduces the central path which underlies interior-point algorithms. The relations 
between primal and dual in this context are examined. An overview of the details 
of specific interior-point algorithms based on the theory are presented in Sections 
5.6 and 5.7 


5.1 ELEMENTS OF COMPLEXITY THEORY 


Complexity theory is arguably the foundation for analysis of computer algorithms. 
The goal of the theory is twofold: to develop criteria for measuring the effectiveness 
of various algorithms (and thus, be able to compare algorithms using these criteria), 
and to assess the inherent difficulty of various problems. 

The term complexity refers to the amount of resources required by a compu- 
tation. In this chapter we focus on a particular resource, namely, computing time. 
In complexity theory, however, one is not interested in the execution time of a 
program implemented in a particular programming language, running on a particular 
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computer over a particular input. This involves too many contingent factors. Instead, 
one wishes to associate to an algorithm more intrinsic measures of its time require- 
ments. 

Roughly speaking, to do so one needs to define: 

e a notion of input size, 

e a set of basic operations, and 

© a cost for each basic operation. 
The last two allow one to associate a cost of a computation. If x is any input, the 
cost C(x) of the computation with input x is the sum of the costs of all the basic 
operations performed during this computation. 

Let A be an algorithm and J, be the set of all its inputs having size n. The 

worst-case cost function of A is the function TY defined by 


Ty (n) = sup C(x). 


xed, 


If there is a probability structure on J, it is possible to define the average-case cost 
function T% given by 


TA (n) =E, (C(x). 


where E,, is the expectation over J,,. However, the average is usually more difficult 
to find, and there is of course the issue of what probabilities to assign. 

We now discuss how the objects in the three items above are selected. The 
selection of a set of basic operations is generally easy. For the algorithms we 
consider in this chapter, the obvious choice is the set {+, —, x, /, <} of the four 
arithmetic operations and the comparison. Selecting a notion of input size and a cost 
for the basic operations depends on the kind of data dealt with by the algorithm. 
Some kinds can be represented within a fixed amount of computer memory; others 
require a variable amount. 

Examples of the first are fixed-precision floating-point numbers, stored in a 
fixed amount of memory (usually 32 or 64 bits). For this kind of data the size of 
an element is usually taken to be | and consequently to have unit size per number. 

Examples of the second are integer numbers which require a number of bits 
approximately equal to the logarithm of their absolute value. This (base 2) logarithm 
is usually referred to as the bit size of the integer. Similar ideas apply for rational 
numbers. 

Let A be some kind of data and x = (x,,...,x,,) € A”. If A is of the first kind 
above then we define size(x) = n. Otherwise, we define size(x) = )°'_, bit-size(x;). 

The cost of operating on two unit-size numbers is taken to be | and is called 
the unit cost. In the bit-size case, the cost of operating on two numbers is the 
product of their bit-sizes (for multiplications and divisions) or their maximum (for 
additions, subtractions, and comparisons). 

The consideration of integer or rational data with their associated bit size and 
bit cost for the arithmetic operations is usually referred to as the Turing model of 
computation. The consideration of idealized reals with unit size and unit cost is 
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today referred as the real number arithmetic model. When comparing algorithms, 
one should make clear which model of computation is used to derive complexity 
bounds. 

A basic concept related to both models of computation is that of polynomial 
time. An algorithm A is said to be a polynomial time algorithm if T’(n) is bounded 
above by a polynomial. A problem can be solved in polynomial time if there is a 
polynomial time algorithm solving the problem. The notion of average polynomial 
time is defined similarly, replacing T% by T%. 

The notion of polynomial time is usually taken as the formalization of efficiency 
in complexity theory. 


“5.2 THE SIMPLEX METHOD IS NOT 
POLYNOMIAL-TIME 


When the simplex method is used to solve a linear program in standard form with 
coefficient matrix A € E”*", b € E” and c € E", the number of pivot steps to solve 
the problem starting from a basic feasible solution is typically a small multiple of 
m: usually between 2m and 3m. In fact, Dantzig observed that for problems with 
m < 50 and n < 200 the number of iterations is ordinarily less than 1.5 m. 

At one time researchers believed—and attempted to prove—that the simplex 
algorithm (or some variant thereof) always requires a number of iterations that is 
bounded by a polynomial expression in the problem size. That was until Victor Klee 
and George Minty exhibited a class of linear programs each of which requires an 
exponential number of iterations when solved by the conventional simplex method. 

One form of the Klee—Minty example is 


n 
maximize pe tuae j 
j=l 


it . 1 
subject to 2° LO ea = 100°! i=1,...,n (1) 


j=l 


The problem above is easily cast as a linear program in standard form. 
A specific case is that for n = 3, giving 


maximize 100x, + 10x, + X3 
subject to =x, < 1 
20x, + § Xy < 100 
200x, + 20x, + x3 < 10, 000 


x, 20, x,20,x,;20. 


In this case, we have three constraints and three variables (along with their 
nonnegativity constraints). After adding slack variables, the problem is in standard 
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form. The system has m = 3 equations and n = 6 nonnegative variables. It can be 
verified that it takes 23 — 1 =7 pivot steps to solve the problem with the simplex 
method when at each step the pivot column is chosen to be the one with the largest 
(because this a maximization problem) reduced cost. (See Exercise 1.) 

The general problem of the class (1) takes 2” — 1 pivot steps and this is in fact 
the number of vertices minus one (which is the starting vertex). To get an idea of 
how bad this can be, consider the case where n = 50. We have 2°°— 1 10!. In 
a year with 365 days, there are approximately 3 x 10’ seconds. If a computer ran 
continuously, performing a million pivots of the simplex algorithm per second, it 
would take approximately 


195 


33 
3x 107 x 108 ee 


to solve a problem of this class using the greedy pivot selection rule. 


"5.3. THE ELLIPSOID METHOD 


The basic ideas of the ellipsoid method stem from research done in the 1960s and 
1970s mainly in the Soviet Union (as it was then called) by others who preceded 
Khachiyan. In essence, the idea is to enclose the region of interest in ever smaller 
ellipsoids. 

The significant contribution of Khachiyan was to demonstrate in that under 
certain assumptions, the ellipsoid method constitutes a polynomially bounded 
algorithm for linear programming. 

The version of the method discussed here is really aimed at finding a point of 
a polyhedral set OQ, given by a system of linear inequalities. 

QO = {y € E” ry a Se; j=l,...n} 
Finding a point of 0 can be thought of as equivalent to solving a linear programming 
problem. 

Two important assumptions are made regarding this problem: 


(Al) There is a vector y,) € E” and a scalar R > 0 such that the closed ball S(yo, R) 
with center y, and radius R, that is 


{ye E”: |y—yo| = 2}, 


contains 2). 

(A2) If © is nonempty, there is a known scalar r > 0 such that Q contains a ball 
of the form S(y*, r) with center at y* and radius r. (This assumption implies 
that if © is nonempty, then it has a nonempty interior and its volume is at 
least vol(S(0, r)))’. 


The (topological) interior of any set Q is the set of points in which are the centers of 
some balls contained in Q. 
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Definition. An ellipsoid in E” is a set of the form 
E= {yc E":(y-2)'Qiy-z) <1} 


where z € E” is a given point (called the center) and Q is a positive definite 
matrix (see Section A.4 of Appendix A) of dimension m x m. This ellipsoid is 
denoted ell(z, Q). 


The unit sphere S(0, 1) centered at the origin 0 is a special ellipsoid with Q = I, 
the identity matrix. 

The axes of a general ellipsoid are the eigenvectors of Q and the lengths of the 
axes are Ay”, A;'/*,..., A7!/?, where the A,’s are the corresponding eigenvalues. 


It can be shown that the volume of an ellipsoid is 


vol(E) = vol(S(0, Dita. = vol(S(0, 1))det(Q~'””). 


Cutting Plane and New Containing Ellipsoid 


In the ellipsoid method, a series of ellipsoids E, is defined, with centers y, and 
with the defining Q = B;', where B, is symmetric and positive definite. 

At each iteration of the algorithm, we have 0 C E,. It is then possible to check 
whether y, € ©. If so, we have found an element of © as required. If not, there is 
at least one constraint that is violated. Suppose ay, >c;. Then 


1 
XC os ={yeE,:ajy<ajy,} 


This set is half of the ellipsoid, obtained by cutting the ellipsoid in half through its 
center. 

The successor ellipsoid E,,, is defined to be the minimal-volume ellipsoid 
containing (1/2)E,. It is constructed as follows. Define 


1 nv 


Y 1 ere > 4? 
m+1 m? — 1 


Fig. 5.1 A half-ellipsoid 
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Then put 
"__B 
Yeti =Vk— Trp 2 Pk 
Ve TB aye 
B,aj;a/ B, 
B,., = 9 |B, - o— i (2) 
a; B,a; 


Theorem 1. The ellipsoid E,., = ell(y,.;, Bt) defined as above is the 
ellipsoid of least volume containing (1/2)E,. Moreover, 


vol(E;,;) m \" VP m 1 j 
= —— < ex = <o, 
vol(E,)  \m—1 m+1?\” o(m4+1 


Proof. We shall not prove the statement about the new ellipsoid being of least 
volume, since that is not necessary for the results that follow. To prove the remainder 
of the statement, we have 


vol(E,,,) _ det (Be 


vol(E;,) det(B,’”) 


For simplicity, by a change of coordinates, we may take B, = I. Then B,,, has 
m—1 eigenvalues equal to 6 = —*, and one eigenvalue equal to 6 — 267 = 

m (| —_2_) = (_".)?, The reduction in volume is the product of the square roots 
m-—1 m+1 m+1 


of these, giving the equality in the theorem. 
Then using (1+ x)’ < e*’, we have 


We (m—1)/2 yi v 1 (m—1)/2 ; it 
m—1 m+1— m—1 m+1 


<2°(smy~@aD) 7? (-Aery) 


Convergence 


The ellipsoid method is initiated by selecting yy and R such that condition (A1) is 
satisfied. Then By = RI, and the corresponding E, contains ©. The updating of 
the E,,’s is continued until a solution is found. 

Under the assumptions stated above, a single repetition of the ellipsoid method 
reduces the volume of an ellipsoid to one-half of its initial value in O(m) iterations. 
(See Appendix A for O notation.) Hence it can reduce the volume to less than that 
of a sphere of radius r in O(m? log(R/r)) iterations, since its volume is bounded 
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from below by vol(S(0, 1))r” and the initial volume is vol(S(0, 1))R”. Generally 
a single iteration requires O(m7) arithmetic operations. Hence the entire process 
requires O(m*log(R/r)) arithmetic operations. 


Ellipsoid Method for Usual Form of LP 


Now consider the linear program (where A is m x n) 


maximize c’x 
(P) subject to Ax <b 
x>0 


and its dual 


minimize y’b 
(D) subject to y’A > ec” 
y>=0. 


Both problems can be solved by finding a feasible point to inequalities 


—c’x+b’y <0 

Ax <b 
—Aly<-c 

x,y>0, 


(3) 


where both x and y are variables. Thus, the total number of arithmetic operations 
for solving a linear program is bounded by O((m+1n)* log(R/r)). 


5.4 THE ANALYTIC CENTER 


The new interior-point algorithms introduced by Karmarkar move by successive 
steps inside the feasible region. It is the interior of the feasible set rather than the 
vertices and edges that plays a dominant role in this type of algorithm. In fact, these 
algorithms purposely avoid the edges of the set, only eventually converging to one 
as a solution. 

Our study of these algorithms begins in the next section, but it is useful at this 
point to introduce a concept that definitely focuses on the interior of a set, termed 
the set’s analytic center. As the name implies, the center is away from the edge. 

In addition, the study of the analytic center introduces a special structure, 
termed a barrier or potential that is fundamental to interior-point methods. 


3 Assumption (A2) is sometimes too strong. It has been shown, however, that when the data 
consists of integers, it is possible to perturb the problem so that (A2) is satisfied and if the 
perturbed problem has a feasible solution, so does the original 0. 
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Consider a set S in a subset of XY of E” defined by a group of inequalities as 


S={xeU:ig(x)>0, j=1,2,...,m}, 


and assume that the functions g; are continuous. S has a nonempty interior S= 
{x € : g(x) > 0, all j}. Associated with this definition of the set is the potential 
function 


m 


W(x) = — Yo log g,(x) 


defined on S. 
The analytic center of S is the vector (or set of vectors) that minimizes the 
potential; that is, the vector (or vectors) that solve 


min (x) = min } — ) log g(x) :x eX, g(x) > 0 for each j¢. 


j=l 


Example 1. (A cube). Consider the set S defined by x, > 0, (1—.x;) > 0, for 
i=1,2,...,n. This is S = [0, 1]”, the unit cube in E”. The analytic center can 
be found by differentiation to be x; = 1/2, for all i. Hence, the analytic center is 
identical to what one would normally call the center of the unit cube. 


In general, the analytic center depends on how the set is defined—on the 
particular inequalities used in the definition. For instance, the unit cube is also 
defined by the inequalities x, >0, (1 —x,)¢ > 0 with d > 1. In this case the solution 
is x; = 1/(d+1) for all i. For large d this point is near the inner corner of the 
unit cube. 

Also, the additional of redundant inequalities can also change the location 
of the analytic center. For example, repeating a given inequality will change the 
center’s location. 

There are several sets associated with linear programs for which the analytic 
center is of particular interest. One such set is the feasible region itself. Another is 
the set of optimal solutions. There are also sets associated with dual and primal-dual 
formulations. All of these are related in important ways. 

Let us illustrate by considering the analytic center associated with a bounded 
polytope in E” represented by n (> m) linear inequalities; that is, 


O= {ye E”":c’-y'A > 90}, 


where A € E”*”" and ¢ € E” are given and A has rank m. Denote the interior of 
QO. by 


= {ye EB" sc" ~yA > 0}. 
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The potential function for this set is 
boy) = — Do log(c;—y’aj) = — ) logs, (4) 
j=l j=l 


where s =c—A’y is a slack vector. Hence the potential function is the negative 
sum of the logarithms of the slack variables. 

The analytic center of QO, is the interior point of © that minimizes the potential 
function. This point is denoted by y“ and has the associated s* =c— A7y‘. The 
pair (y“, Ss“) is uniquely defined, since the potential function is strictly convex (see 
Section 7.4) in the bounded convex set 2. 

Setting to zero the derivatives of y(y) with respect to each y, gives 


n 


Re 
>> —— =), for all i. 


ya. 
jal G7 Yj 


which can be written 
” aij . 
> — =0, for all 7. 
j=l 5 
Now define x; = 1/s; for each j. We introduce the notion 


_ T 
XOS = (X1 581, X78), ---,Xy5n) 


which is component multiplication. Then the analytic center is defined by the 
conditions 


xos=1 
Ax =0 
A’y+s=c. 


The analytic center can be defined when the interior is empty or equalities are 
present, such as 


OQ = {fy €E”: ce’ -y™A>O0, By=D}. 


In this case the analytic center is chosen on the linear surface {y : By = b} to 
maximize the product of the slack variables s = c—ATy. Thus, in this context 
the interior of Q refers to the interior of the positive orthant of slack variables: 
R'. = {s: s > 0}. This definition of interior depends only on the region of the slack 
variables. Even if there is only a single point in 0 with s=c—AVy for some y 
where By = b with s > 0, we still say that Qis not empty. 
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5.5 THE CENTRAL PATH 


The concept underlying interior-point methods for linear programming is to use 
nonlinear programming techniques of analysis and methodology. The analysis is 
often based on differentiation of the functions defining the problem. Traditional 
linear programming does not require these techniques since the defining functions 
are linear. Duality in general nonlinear programs is typically manifested through 
Lagrange multipliers (which are called dual variables in linear programming). The 
analysis and algorithms of the remaining sections of the chapter use these nonlinear 
techniques. These techniques are discussed systematically in later chapters, so rather 
than treat them in detail at this point, these current sections provide only minimal 
detail in their application to linear programming. It is expected that most readers 
are already familiar with the basic method for minimizing a function by setting 
its derivative to zero, and for incorporating constraints by introducing Lagrange 
multipliers. These methods are discussed in detail in Chapters 11-15. 

The computational algorithms of nonlinear programming are typically iterative 
in nature, often characterized as search algorithms. At any step with a given point, 
a direction for search is established and then a move in that direction is made to 
define the next point. There are many varieties of such search algorithms and they 
are systematically presented throughout the text. In this chapter, we use versions of 
Newton’s method as the search algorithm, but we postpone a detailed study of the 
method until later chapters. 

Not only have nonlinear methods improved linear programming, but interior- 
point methods for linear programming have been extended to provide new 
approaches to nonlinear programming. This chapter is intended to show how 
this merger of linear and nonlinear programming produces elegant and effective 
methods. These ideas take an especially pleasing form when applied to linear 
programming. Study of them here, even without all the detailed analysis, should 
provide good intuitive background for the more general manifestations. 

Consider a primal linear program in standard form 


(LP) minimize e’x (5) 
subject to Ax =b 
x>0. 
We denote the feasible region of this program by F,. We assume that F p= (x: 


Ax =b, x > 0} is nonempty and the optimal solution set of the problem is bounded. 
Associated with this problem, we define for uw > 0 the barrier problem 


(BP) minimize e’x— >" log x; (6) 
j=! 


subject to Ax =b 


x>0. 
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It is clear that 4 = 0 corresponds to the original problem (5). As pp — oo, the 
solution approaches the analytic center of the feasible region (when it is bounded), 
since the barrier term swamps out c7x in the objective. As yz is varied continuously 
toward 0, there is a path x(y) defined by the solution to (BP). This path x() is 
termed the primal central path. As 4 — 0 this path converges to the analytic center 
of the optimal face {x : e’x = z*, Ax =b,x > 0}, where z* is the optimal value 
of (LP). 

A strategy for solving (LP) is to solve (BP) for smaller and smaller values 
of yw and thereby approach a solution to (LP). This is indeed the basic idea of 
interior-point methods. 

At any 2 > 0, under the assumptions that we have made for problem (5), the 
necessary and sufficient conditions for a unique and bounded solution are obtained 
by introducing a Lagrange multiplier vector y for the linear equality constraints to 
form the Lagrangian (see Chapter 11) 


e’x—) log x; —y’(Ax—b). 


j=l 
The derivatives with respect to the x;’s are set to zero, leading to the conditions 
Cj — B/X; —y’a, = 0, for each j 
or equivalently 
px '1+ATy=c (7) 


where as before a; is the j-th column of A,1 is the vector of 1’s, and X is 


the diagonal matrix whose diagonal entries are the components of x > 0. Setting 
5; = /x; the complete set of conditions can be rewritten 


xos=pl 
Ax=b (8) 
A’y+s=c. 


Note that y is a dual feasible solution and c— A’y > 0 (see Exercise 4). 


Example 2. (A square primal). Consider the problem of maximizing x, within 
the unit square S = [0, 1]*. The problem is formulated as 


min —x, 
subject to xX, +x,=1 
Xy>+xX,= 1 


x, 20, x, 20, x3 20, x4 20. 
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Here x, and x, are slack variables for the original problem to put it in standard 
form. The optimality conditions for x(jz) consist of the original 2 linear constraint 
equations and the four equations 


y+5,=1 
yo +5 =0 
¥ +5; =0 
¥2 +5, =0 


together with the relations s, = w/x,; for i= 1,2...,4. These equations are readily 
solved with a series of elementary variable eliminations to find 


1-2u+/1+4p2 


2 


x (MW) = 
xy(u) = 1/2. 
Using the “+” solution, it is seen that as 4 — 0 the solution goes to x > (1, 1/2). 
Note that this solution is not a corner of the cube. Instead it is at the analytic center 
of the optimal face {x:x,=1, 0O< x, < 1}. See Fig. 5.2. The limit of x() as 
[4 > c can be seen to be the point (1/2, 1/2). Hence, the central path in this case 


is a straight line progressing from the analytic center of the square (at 4 — oo) to 
the analytic center of the optimal face (at x — 0). 


Dual Central Path 


Now consider the dual problem 


(LD) maximize y’b 
subject to y'A+s’ =c" 


s>0. 


XQ 


xy 
0 1 


Fig. 5.2 The analytic path for the square 
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We may apply the barrier approach to this problem by formulating the problem 


(BD) maximize y’b+ > - logs; 


j=1 
subject to y'A+s’ =c" 
s>0. 
We assume that the dual feasible set F, has an interior F a= {ly,s):y’A+s’ = 
c’, s > 0} is nonempty and the optimal solution set of (LD) is bounded. Then, as wu 
is varied continuously toward 0, there is a path (y(), s(w)) defined by the solution 
to (BD). This path is termed the dual central path. 


To work out the necessary and sufficient conditions we introduce x as a 
Lagrange multiplier and form the Lagrangian 


y'b+pm)_ log s,—(y’A+s’ —c’)x. 


j=l 
Setting to zero the derivative with respect to y, leads to 
b;- a'x = 0, for all i 


where a’ is the i-th row of A. Setting to zero the derivative with respect to s ; leads 
to 


b/s;—x; =0, for all j. 


Combining these equations and including the original constraint yields the complete 
set of conditions 


xos=pl 
Ax=b 
A’y+s=c. 


These are identical to the optimality conditions for the primal central path (8). Note 
that x is a primal feasible solution and x > 0. 

To see the geometric representation of the dual central path, consider the dual 
level set 


O(z) ={y: e7-y7ASO, y’b>z} 
for any z < z* where z* is the optimal value of (LD). Then, the analytic center 


(y(z), S(z)) of Q(z) coincides with the dual central path as z tends to the optimal 
value z* from below. This is illustrated in Fig. 5.3, where the feasible region of 
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The objective hyperplanes e 


Fig. 5.3 The central path as analytic centers in the dual feasible region 


the dual set (not the primal) is shown. The level sets Q(z) are shown for various 
values of z. The analytic centers of these level sets correspond to the dual central 
path. 


Example 3. (The square dual). Consider the dual of example 2. This is 


max y+ y2 
subject to y, <—1 
y, <0. 
(The values of s, and s, are the slack variables of the inequalities.) The solution 


to the dual barrier problem is easily found from the solution of the primal barrier 
problem to be 


yi (H) = -1—p/x)(H), yo = —2p. 


As & — 0, we have y, > —1, y, > 0, which is the unique solution to the dual LP. 
However, as  — oo, the vector y is unbounded, for in this case the dual feasible 
set is itself unbounded. 


Primal—Dual Central Path 


Suppose the feasible region of the primal (LP) has interior points and its optimal 
solution set is bounded. Then, the dual also has interior points (see Exercise 4). The 
primal—dual path is defined to be the set of vectors (x(t), y(w), S(w)) that satisfy 
the conditions 


xos=pl 
Ax=b 
A’y+s=c ) 


x>0, s>0 
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for 0 < w < o. Hence the central path is defined without explicit reference to 
an optimization problem. It is simply defined in terms of the set of equality and 
inequality conditions. 

Since conditions (8) and (9) are identical, the primal—dual central path can be 
split into two components by projecting onto the relevant space, as described in the 
following proposition. 


Proposition I. Suppose the feasible sets of the primal and dual programs 
contain interior points. Then the primal—dual central path (x(q), y(w), S()) 
exists for all w, 0 < w < co. Furthermore, x() is the primal central path, 
and (y(w), S()) is the dual central path. Moreover, x(t.) and (y(), S()) 
converge to the analytic centers of the optimal primal solution and dual solution 
faces, respectively, as ju — 0. 


Duality Gap 


Let (x(w), y(), S()) be on the primal-dual central path. Then from (9) it follows 
that 


e’x—y’b=y'Ax+s’x—y’b=s'x= mp. 


The value c’x —y’b =s’x is the difference between the primal objective value and 
the dual objective value. This value is always nonnegative (see the weak duality 
lemma in Section 4.2) and is termed the duality gap. 

The duality gap provides a measure of closeness to optimality. For any primal 
feasible x, the value e’x gives an upper bound as ¢’x > z* where z* is the optimal 
value of the primal. Likewise, for any dual feasible pair (y, s), the value y’b gives 
a lower bound as y’b < z*. The difference, the duality gap g = c¢’x—y’b, provides 
a bound on z* as z* > e’x—g. Hence if at a feasible point x, a dual feasible (y, s) 
is available, the quality of x can be measured as c’x — z* < g. 

At any point on the primal—dual central path, the duality gap is equal to nw. 
It is clear that as w — O the duality gap goes to zero, and hence both x() and 
(y(), S(4)) approach optimality for the primal and dual, respectively. 


5.6 SOLUTION STRATEGIES 


The various definitions of the central path directly suggest corresponding strategies 
for solution of a linear program. We outline three general approaches here: the 
primal barrier or path-following method, the primal-dual path-following method 
and the primal-dual potential-reduction method, although the details of their imple- 
mentation and analysis must be deferred to later chapters after study of general 
nonlinear methods. Table 5.1 depicts these solution strategies and the simplex 
methods described in Chapters 3 and 4 with respect to how they meet the three 
optimality conditions: Primal Feasibility, Dual Feasibility, and Zero-Duality during 
the iterative process. 
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Table 5.1 Properties of algorithms 


P-F D-F 0-Duality 


J Vv 


Primal Simplex xf 
Dual Simplex 

Primal Barrier J 
Primal-Dual Path-Following a 
Primal-Dual Potential-Reduction xf 


Vv 
J 


For example, the primal simplex method keeps improving a primal feasible 
solution, maintains the zero-duality gap (complementarity slackness condition) 
and moves toward dual feasibility; while the dual simplex method keeps 
improving a dual feasible solution, maintains the zero-duality gap (complemen- 
tarity condition) and moves toward primal feasibility (see Section 4.3). The 
primal barrier method keeps improving a primal feasible solution and moves 
toward dual feasibility and complementarity; and the primal-dual interior-point 
methods keep improving a primal and dual feasible solution pair and move toward 
complementarity. 


Primal Barrier Method 


A direct approach is to use the barrier construction and solve the the problem 


minimize c?x — uw dij=1 log x; a0) 
subject to Ax=b 
x>0, 


for a very small value of jw. In fact, if we desire to reduce the duality gap to ¢ it is 
only necessary to solve the problem for w = e/n. Unfortunately, when py is small, 
the problem (10) could be highly ill-conditioned in the sense that the necessary 
conditions are nearly singular. This makes it difficult to directly solve the problem 
for small p. 

An overall strategy, therefore, is to start with a moderately large (say w= 
100) and solve that problem approximately. The corresponding solution is a point 
approximately on the primal central path, but it is likely to be quite distant from the 
point corresponding to the limit of ~ — 0. However this solution point at uw = 100 
can be used as the starting point for the problem with a slightly smaller yw, for this 
point is likely to be close to the solution of the new problem. The value of w might 
be reduced at each stage by a specific factor, giving fW,,; = yu,, Where y is a fixed 
positive parameter less than one and k is the stage count. 
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If the strategy is begun with a value jp, then at the k-th stage we have 
LL, = y* Mo. Hence to reduce 2; /M to below e, requires 


1 
pe 
log y 


stages. 

Often a version of Newton’s method for minimization is used to solve each of 
the problems. For the current strategy, Newton’s method works on problem (10) 
with fixed yw by considering the central path equations (8) 


xos=pl 
Ax=b (11) 
A’y+s=c. 


From a given point x ¢F,, Newton’s method moves to a closer point xt « F, 
by moving in the directions d,, d, and d, determined from the linearized version 
of (11) 


Xd, +d, = wX-!1—e, 
Ad, = 0, (12) 


(Recall that X is the diagonal matrix whose diagonal entries are components of 
x > 0.) The new point is then updated by taking a step in the direction of d,, as 
xt =x+d.,. 

Notice that if xos = 1 for some s= c— Ay, then d= (d,, d,, d,) = 0 because 
the current point satisfies Ax = b and hence is already the central path solution for 
pu. If some component of xos is less than yw, then d will tend to increment the 
solution so as to increase that component. The converse will occur for components 
of xos greater than pw. 

This process may be repeated several times until a point close enough to the 
proper solution to the barrier problem for the given value of pu is obtained. That is, 
until the necessary and sufficient conditions (7) are (approximately) satisfied. 

There are several details involved in a complete implementation and analysis of 
Newton’s method. These items are discussed in later chapters of the text. However, 
the method works well if either zs is moderately large, or if the algorithm is initiated 
at a point very close to the solution, exactly as needed for the barrier strategy 
discussed in this subsection. 

To solve (12), premultiplying both sides by X” we have 


pd, + X?d, = 4X1 — Xe. 
Then, premultiplying by A and using Ad, = 0, we have 


AX?d, = wAX1— AX’c. 
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Using d, = —A’d, we have 

(AX°A")d, = —AX1 + AX’c. 
Thus, dy can be computed by solving the above linear system of equations. Then d, 
can be found from the third equation in (12) and finally d, can be found from the 


first equation in (12), together this amounts to O(nm? + m°) arithmetic operations 
for each Newton step. 


*Primal-Dual Path-Following 


Another strategy for solving a linear program is to follow the central path from a 
given initial primal-dual solution pair. Consider a linear program in standard form 


(LP) minimize e’x 
subject to Ax =b 


x>0. 


(LD) maximize y’b 
subject to y'A+s’ =c" 


s>0. 
Assume that # @; that is, both*. 
F = {x: Ax=b, x>0}49 
and 
F y= {(y,8): S=e—ATy > 0} £9, 


and denote by z* the optimal objective value. 
The central path can be expressed as 


° x's 
C= }4(x,y,s)€F: xos= —1 
nN 


in the primal-dual form. On the path we have xos = yl and hence s’x = np. A 
neighborhood of the central path € is of the form 


N(n) = {(x, y, 8) eF: |sox — pl] < nu, where p = s’x/n} (13) 


“The symbol @ denotes the empty set. 


130 Chapter 5 Interior-Point Methods 


for some 7 € (0, 1), say 7 = 1/4. This can be thought of as a tube whose center is 
the central path. 

The idea of the path-following method is to move within a tubular neighborhood 
of the central path toward the solution point. A suitable initial point (x°, y°, s°) € 
N(m) can be found by solving the barrier problem for some fixed fy or from 
an initialization phase proposed later. After that, step by step moves are made, 
alternating between a predictor step and a corrector step. After each pair of steps, 
the point achieved is again in the fixed given neighborhood of the central path, but 
closer to the linear program’s solution set. 

The predictor step is designed to move essentially parallel to the true central 
path. The step d= (d,,d,,d,) is determined from the linearized version of the 
primal-dual central path equations of (9), as 


sod,+xod, = yul—xos, 
Ad, = 0, (14) 
—A'd,—d,=0, 


where here one selects y = 0. (To show the dependence of d on the current pair 
(x, s) and the parameter y, we write d = d(x, s, y).) 

The new point is then found by taking a step in the direction of d, as 
(x*, yt, s*) = (x, y,s)+ a(d,,d,,d,), where a is the step-size. Note that d{d, = 
—d{ Ad, = 0 here. Then 


(xt)’st = (x+ ad,)’ (s+ ad,) = x’s+ a(d?s+x"d,) = (1—a)x’s, 


where the last step follows by multiplying the first equation in (14) by 17. Thus, 
the predictor step reduces the duality gap by a factor 1 — a. The maximum possible 
step-size q@ in that direction is made in that parallel direction without going outside 
of the neighborhood NV (27). 

The corrector step essentially moves perpendicular to the central path in order 
to get closer to it. This step moves the solution back to within the neighborhood 
N(7), and the step is determined by selecting y = | in (14) with w =x’s/n. Notice 
that if xos = wl, then d = 0 because the current point is already a central path 
solution. 

This corrector step is identical to one step of the barrier method. Note, however, 
that the predictor—corrector method requires only one sequence of steps, each 
consisting of a single predictor and corrector. This contrasts with the barrier method 
which requires a complete sequence for each pw to get back to the central path, and 
then an outer sequence to reduce the p2’s. 

One can prove that for any (x, y,s) € N(n) with uw =x's/n, the step-size in 
the predictor stop satisfies 


1 
2/n- 


Thus, the iteration complexity of the method is O(./7) log(1/e)) to achieve / py < 
é where nj, is the initial duality gap. Moreover, one can prove that the step-size 


a2 
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a— 1 as x's — 0, that is, the duality reduction speed is accelerated as the gap 
becomes smaller. 


Primal-Dual Potential Function 


In this method a primal-dual potential function is used to measure the solution’s 
progress. The potential is reduced at each iteration. There is no restriction on either 
neighborhood or step-size during the iterative process as long as the potential is 
reduced. The greater the reduction of the potential function, the faster the conver- 
gence of the algorithm. Thus, from a practical point of view, potential-reduction 
algorithms may have an advantage over path-following algorithms where iterates 
are confined to lie in certain neighborhoods of the central path. 


For x CF » and (y, s) cF the primal—dual potential function is defined by 


Wnsp(X, 8) = (n+p) log(x’s) — }> log(x;s;), (15) 


j=! 


where p > 0. 
From the arithmetic and geometric mean inequality (also see Exercise 10) we 
can derive that 


nlog(x’s) — }“log(x,s;) > nlogn. 
j=l 
Then 
Pn+p(X, S) = plog(x’s) +nlog(x’s)— >> log(x;5;) > p log(x’s)+nlogn. (16) 


j=l 


Thus, for p > 0, ,,,,(x, $8) > —oo implies that x's — 0. More precisely, we have 
from (16) 


rsp (x, s) aA log *) 


x5 exp ( 
p 


Hence the primal—dual potential function gives an explicit bound on the magnitude 
of the duality gap. 

The objective of this method is to drive the potential function down toward 
minus infinity. The method of reduction is a version of Newton’s method (14). 
In this case we select y= n/(n+p) in (14). Notice that that is a combination of 
a predictor and corrector choice. The predictor uses y = 0 and the corrector uses 
y = 1. The primal—dual potential method uses something in between. This seems 
logical, for the predictor moves parallel to the central path toward a lower duality 
gap, and the corrector moves perpendicular to get close to the central path. This new 
method does both at once. Of course, this intuitive notion must be made precise. 
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For p > ./n, there is in fact a guaranteed decrease in the potential function by 
a fixed amount 6 (see Exercises 12 and 13). Specifically, 


Pn+p (x, s*) -_ Pn+p (x, s) < =O (17) 


for a constant 6 > 0.2. This result provides a theoretical bound on the number 
of required iterations and the bound is competitive with other methods. However, 
a faster algorithm may be achieved by conducting a line search along direction 
d to achieve the greatest reduction in the primal-dual potential function at each 
iteration. 

We outline the algorithm here: 


Step 1. Start at a point (Xp, Yo, Sp) CF with ,,,,(Xo,8) < plog((s))?x9) + 
nlogn+O(,/nlogn) which is determined by an initiation procedure, as discussed 


in Section 5.7. Set p> /n. Set k=O and y=n/(n+/p). Select an accuracy 
parameter ¢ > 0. 


Step 2. Set (x, 8) = (x;, S,) and compute (d,, d,,d,) from (14). 
Step 3. Step 3. Let x,,; = x, +ad,, y,,; =y, + @d,, and s,,, =s,-+ ad, where 


a= arg min Wn p(X + ad, 8, + ad,). 


Step 4. Step 4. Letk=k-+1. If “2s < e, Stop. Otherwise return to Step 2. 
0 


Theorem 2. The algorithm above terminates in at most O(plog(n/é)) 
iterations with 


(S,)"X; = 
(S9)7X) 


Proof. Note that after k iterations, we have from (17) 
Wrap (Xe 8x) < Wasp (Xo 8) —k-6 < plog((sy)"x) + nlogn+ O(./nlogn) —k-6. 
Thus, from the inequality (16), 
plog(s(x,) +nlogn < plog(s,x)) +nlogn+ O(./nlogn) —k-6, 
or 


p(log(s;x,) —log(s}x»)) < —k-6 + O(./nlogn). 
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Therefore, as soon as k > O(plog(n/e)), we must have 


p(log(s; x;,) —log(sjXo)) < —plog(1/e), 
or 


ie 
SEX , 


8X0 


Theorem 2 holds for any p > \/n. Thus, by choosing p = ./n, the iteration 
complexity bound becomes O(./n log(n/e)). 


Iteration Complexity 


The computation of each iteration basically requires solving (14) for d. Note that 
the first equation of (14) can be written as 


Sd, + Xd, = yw1—XS1 


where X and S are two diagonal matrices whose diagonal entries are components 
of x > 0 and s > 0, respectively. Premultiplying both sides by S~! we have 


d,+S~'Xd, = ywS7'1—x. 
Then, premultiplying by A and using Ad, = 0, we have 
AS"'Xd, = ywAS"'1— Ax = ywAS"'1—b. 
Using d, = —A’d, we have 
(AS"'XA‘)d, = b— ywAS“'1. 


Thus, the primary computational cost of each iteration of the interior-point 
algorithm discussed in this section is to form and invert the normal matrix AXS"'A’, 
which typically requires O(nm? + m*) arithmetic operations. However, an approx- 
imation of this matrix can be updated and inverted using far fewer arithmetic 
operations. In fact, using a rank-one technique (see Chapter 10) to update the 
approximate inverse of the normal matrix during the iterative progress, one can 
reduce the average number of arithmetic operations per iteration to O(,/nm7). Thus, 
if the relative tolerance ¢ is viewed as a variable, we have the following total 
arithmetic operation complexity bound to solve a linear program: 


Corollary. Let p= ./n. Then, the algorithm above Theorem 2 terminates in 
at most O(nm? log(n/e)) arithmetic operations. 
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5.7 TERMINATION AND INITIALIZATION 


There are several remaining important issues concerning interior-point algorithms 
for linear programs. The first issue involves termination. Unlike the simplex method 
which terminates with an exact solution, interior-point algorithms are continuous 
optimization algorithms that generate an infinite solution sequence converging to 
an optimal solution. If the data of a particular problem are integral or rational, an 
argument is made that, after the worst-case time bound, an exact solution can be 
rounded from the latest approximate solution. Several questions arise. First, under 
the real number computation model (that is, the data consists of real numbers), how 
can we terminate at an exact solution? Second, regardless of the data’s status, is 
there a practical test, which can be computed cost-effectively during the iterative 
process, to identify an exact solution so that the algorithm can be terminated before 
the worse-case time bound? Here, by exact solution we mean one that could be found 
using exact arithmetic, such as the solution of a system of linear equations, which 
can be computed in a number of arithmetic operations bounded by a polynomial in n. 

The second issue involves initialization. Almost all interior-point algorithms 


require the regularity assumption that #4 Y. What is to be done if this is not true? 
A related issue is that interior-point algorithms have to start at a strictly feasible 
point near the central path. 


Termination 


Complexity bounds for interior-point algorithms generally depend on an ¢ which 
must be zero in order to obtain an exact optimal solution. Sometimes it is advanta- 
geous to employ an early termination or rounding method while ¢ is still moderately 
large. There are five basic approaches. 


e A “purification” procedure finds a feasible corner whose objective value is at 
least as good as the current interior point. This can be accomplished in strongly 
polynomial time (that is, the complexity bound is a polynomial only in the 
dimensions m and n). One difficulty is that there may be many non-optimal 
vertices close to the optimal face, and the procedure might require many pivot 
steps for difficult problems. 


e A second method seeks to identify an optimal basis. It has been shown that if the 
linear program is nondegenerate, the unique optimal basis may be identified early. 
The procedure seems to work well for some problems but it has difficulty if the 
problem is degenerate. Unfortunately, most real linear programs are degenerate. 


e The third approach is to slightly perturb the data such that the new program 
is nondegenerate and its optimal basis remains one of the optimal bases of the 
original program. There are questions about how and when to perturb the data 
during the iterative process, decisions which can significantly affect the success 
of the effort. 
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e The fourth approach is to guess the optimal face and find a feasible solution on 
that face. It consists of two phases: the first phase uses interior point algorithms to 
identify the complementarity partition (P*, Z*) (see Exercise 6), and the second 
phase adapts the simplex method to find an optimal primal (or dual) basic solution 
and one can use (P*, Z*) as a starting base for the second phase. This method is 
often called the cross-over method. It is guaranteed to work in finite time and is 
implemented in several popular linear programming software packages. 


e The fifth approach is to guess the optimal face and project the current interior 
point onto the interior of the optimal face. See Fig. 5.4. The termination criterion 
is guaranteed to work in finite time. 


The fourth and fifth methods above are based on the fact that (as observed in practice 
and subsequently proved) many interior-point algorithms for linear programming 
generate solution sequences that converge to a strictly complementary solution or 
an interior solution on the optimal face; see Exercise 8. 


Initialization 


Most interior-point algorithms must be initiated at a strictly feasible point. The 

complexity of obtaining such an initial point is the same as that of solving the 

linear program itself. More importantly, a complete algorithm should accomplish 

two tasks: 1) detect the infeasibility or unboundedness status of the problem, then 

2) generate an optimal solution if the problem is neither infeasible nor unbounded. 
Several approaches have been proposed to accomplish these goals: 


e The primal and dual can be combined into a single linear feasibility problem, and 
a feasible point found. Theoretically, this approach achieves the currently best 
iteration complexity bound, that is, O(./nlog(1/e)). Practically, a significant 
disadvantage of this approach is the doubled dimension of the system of equations 
that must be solved at each iteration. 


Objective 
hyperplane 


Central path 


Fig. 5.4 Illustration of the projection of an interior point onto the optimal face 
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e The big-M method can be used by adding one or more artificial column(s) and/or 
row(s) and a huge penalty parameter M to force solutions to become feasible 
during the algorithm. A major disadvantage of this approach is the numerical 
problems caused by the addition of coefficients of large magnitude. 


Phase I-then-Phase II methods are effective. A major disadvantage of this 
approach is that the two (or three) related linear programs must be solved sequen- 
tially. 


e A modified Phase I-Phase II method approaches feasibility and optimality simul- 
taneously. To our knowledge, the currently best iteration complexity bound of 
this approach is O(nlog(1/e)), as compared to O(./nlog(1/e)) of the three 
above. Other disadvantages of the method include the assumption of non-empty 
interior and the need of an objective lower bound. 


The HSD Algorithm 


There is an algorithm, termed the Homogeneous Self-Dual Algorithm that overcomes 
the difficulties mentioned above. The algorithm achieves the theoretically best 
O(./nlog(1/e)) complexity bound and is often used in linear programming software 
packages. 

The algorithm is based on the construction of a homogeneous and self-dual 
linear program related to (LP) and (LD) (see Section 5.5). We now briefly explain 
the two major concepts, homogeneity and self-duality, used in the construction. 

In general, a system of linear equations of inequalities is homogeneous if the 
right hand side components are all zero. Then if a solution is found, any positive 
multiple of that solution is also a soltution. In the constuction used below, we 
allow a single inhomogeneous constraint, often called a normalizing constraint. 
Karmarkar’s original canonical form is a homogeneous linear program. 

A linear program is termed self-dual if the dual of the problem is equivalent to 
the primal. The advantage of self-duality is that we can apply a primal-dual interior- 
point algorithm to solve the self-dual problem without doubling the dimension of 
the linear system solved at each iteration. 

The homogeneous and self-dual linear program (HSDP) is constructed from 
(LP) and (LD) in such a way that the point x =1, y=0, T= 1, z=1, 0O=1 is 
feasible. The primal program is 


(HSDP) minimize (n+1)0 
Subject to Ax —brt +bd=0, 
—Aly +er —c0>0, 
b’y —c’x +20 > 0, 
—b’y +¢'’x —-2z7 =—(n+1), 


y free,x>0,7>0, 6 free, 
where 


b=b—Al, ¢=c-1, z=c7l1+1. (18) 
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Notice that b, @, and Z represent the “infeasibility” of the initial primal point, dual 
point, and primal-dual “gap,” respectively. They are chosen so that the system is 
feasible. For example, for the point x = 1, y= 0, 7= 1, 0=1, the last equation 
becomes 


O+e'x—1'x—(c'x4+1) =—n-1. 


Note also that the top two constraints in (HSDP), with t= 1 and 6 = 0, 
represent primal and dual feasibility (with x > 0). The third equation represents 
reversed weak duality (with b’y > c’x) rather than the reverse. So if these three 
equations are satisfied with t= 1 and 6 = 0 they define primal and dual optimal 
solutions. Then, to achieve primal and dual feasibility for x = 1, (y, s) = (0, 1), we 
add the artificial variable 6. The fourth constraint is added to achieve self-duality. 

The problem is self-dual because its overall coefficient matrix has the property 
that its transpose is equal to its negative. It is skew-symmetric. 

Denote by s the slack vector for the second constraint and by «x the slack 
scalar for the third constraint. Denote by F;, the set of all points (y, x, 7, 6, s, kK) 
that are feasible for (HSDP). Denote by F,? the set of strictly feasible points with 
(x, 7,8, K) > 0 in F,. By combining the constraints (Exercise 14) we can write the 
last (equality) constraint as 


1?x4+1'%s+7+K—(n+1)0=(n+1), (19) 


which serves as a normalizing constraint for (HSDP). This implies that for 0 < 0 < 1 
the variables in this equation are bounded. 
We state without proof the following basic result. 


Theorem I Consider problems (HSDP). 


(i) (HSDP) has an optimal solution and its optimal solution set is bounded. 
(ii) The optimal value of (HSDP) is zero, and 


(y, x, 7,0,8,K)€F, implies that (n+1)0=x's+rk. 


(iii) There is an optimal solution (y*, x*, T*, 0* =0,s*, K*) € F,, such that 


which we call a strictly self-complementary solution. 


Part (ii) of the theorem shows that as 6 goes to zero, the solution tends toward 
satisfying complementary slackness between x and s and between 7 and xk. Part 
(iii) shows that at a solution with @ = 0, the complemenary slackness is strict in 
the sense that at least one member of a complemenary pair must be positive. For 
example, x,s, =0 is required by complementary slackness, but in this case x, = 0, 
Ss, =0 will not occur; exactly one of them must be positive. 

We now relate optimal solutions to (HSDP) to those for (LP) and (LD). 
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Theorem 2 Let (y*, x*, T*, 0* =0, s*, k*) be a strictly-self complementary 
solution for (HSDP). 


(i) (LP) has a solution (feasible and bounded) if and only if t* > 0. In this 
case, x*/T* is an optimal solution for (LP) and y*/t*,s*/t* is an optimal 
solution for (LD). 

(ii) (LP) has no solution if and only if k* > 0. In this case, x*/k* or y*/K* 
or both are certificates for proving infeasibility: if ¢'x* <0 then (LD) is 
infeasible; if —b’y* < 0 then (LP) is infeasible; and if both e'x* <0 and 
—b’y* <0 then both (LP) and (LD) are infeasible. 


Proof. We prove the second statement. We first assumme that one of (LP) and 
(LD) is infeasible, say (LD) is infeasible. Then there is some certificate x > 0 such 
that Ax = 0 and C’x = —1. Let (y =0,5=0) and 


n+1 0 
a= = > 
17x+1's+1 


Then one can verify that 


j* = ay, X* = ak, 7* = 0, 0° = 0,8 = a8, K* =a 


is a self-complementary solution for (HSDP). Since the supporting set (the set of 
positive entries) of a strictly complementary solution for (HSDP) is unique (see 
Exercise 6), K* > 0 at any strictly complementary solution for (HSDP). 

Conversely, if 7* = 0, then x* > 0, which implies that c7x* — b’y* < 0, ice., 
at least one of c’x* and —b’y* is strictly less than zero. Let us say e7x* < 0. In 
addition, we have 


Ax* = 0, A’y*+s* = 0, (x*)"s* = 0 and x*+5s* > 0. 
From Farkas’ lemma (Exercise 5), x*/x* is a certificate for proving dual 
infeasibility. The other cases hold similarly. J 


To solve (HSDP), we have the following theorem that resembles the the central 
path analyzed for (LP) and (LD). 


Theorem 3 Consider problem (HSDP). For any jz > 0, there is a unique 
(y, x, 7, 0,8, K) in F,, such that 


Xos 
( oo) =u. 


Moreover, (x, T) = (1,1), (y, s, K) = (0,0, 1) and 6 = 1 is the solution with 
w=. 
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Theorem 3 defines an endogenous path associated with (HSDP): 


iT, 
e= Lin mr0sx)e9:(%2) = SEM), 
TK et 


Furthermore, the potential function for (HSDP) can be defined as 


Yratep (X78, K) = (n+ 1 +p) log(x’s+ 7x) — Ylog(x)s;)—log(rk), (20) 


j=l 


where p > 0. One can then apply the interior-point algorithms described earlier to 
solve (HSDP) from the initial point (x, 7) = (1, 1), (y, s, x) = (0, 1,1) and d= 1 
with w = (x’s+7k)/(n+1)=1. 
The HSDP method outlined above enjoys the following properties: 
e It does not require regularity assumptions concerning the existence of optimal, 
feasible, or interior feasible solutions. 


e It can be initiated at x = 1, y= 0 and s = 1, feasible or infeasible, on the 
central ray of the positive orthant (cone), and it does not require a big-M penalty 
parameter or lower bound. 


e Each iteration solves a system of linear equations whose dimension is almost the 
same as that used in the standard (primal-dual) interior-point algorithms. 


e If the linear program has a solution, the algorithm generates a sequence that 
approaches feasibility and optimality simultaneously; if the problem is infeasible 
or unbounded, the algorithm produces an infeasibility certificate for at least one 
of the primal and dual problems; see Exercise 5. 


5.8 SUMMARY 


The simplex method has for decades been an efficient method for solving linear 
programs, despite the fact that there are no theoretical results to support its 
efficiency. Indeed, it was shown that in the worst case, the method may visit every 
vertex of the feasible region and this can be exponential in the number of variables 
and constraints. If on practical problems the simplex method behaved according 
to the worst case, even modest problems would require years of computer time 
to solve. The ellipsoid method was the first method that was proved to converge 
in time proportional to a polynomial in the size of the program, rather than to 
an exponential in the size. However, in practice, it was disappointingly less fast 
than the simplex method. Later, the interior-point method of Karmarkar signifi- 
cantly advanced the field of linear programming, for it not only was proved to be a 
polynomial-time method, but it was found in practice to be faster than the simplex 
method when applied to general linear programs. 

The interior-point method is based on introducing a logarithmic barrier function 
with a weighting parameter 4; and now there is a general theoretical structure 
defining the analytic center, the central path of solutions as x — 0, and the duals 
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of these concepts. This structure is useful for specifying and analyzing various 
versions of interior point methods. 

Most methods employ a step of Newton’s method to find a point near the 
central path when moving from one value of jz to another. One approach is the 
predictor—corrector method, which first takes a step in the direction of decreasing ju 
and then a corrector step to get closer to the central path. Another method employs 
a potential function whose value can be decreased at each step, which guarantees 
convergence and assures that intermediate points simultaneously make progress 
toward the solution while remaining close to the central path. 

Complete algorithms based on these approaches require a number of other 
features and details. For example, once systematic movement toward the solution 
is terminated, a final phase may move to a nearby vertex or to a non-vertex point 
on a face of the constraint set. Also, an initial phase must be employed to obtain 
an feasible point that is close to the central path from which the steps of the search 
algorithm can be started. These features are incorporated into several commercial 
software packages, and generally they perform well, able to solve very large linear 
programs in reasonable time. 


5.9 EXERCISES 


1. Using the simplex method, solve the program (1) and count the number of pivots required. 

2. Prove the volume reduction rate in Theorem 1 for the ellipsoid method. 

3. Develop a cutting plane method, based on the ellipsoid method, to find a point satisfying 
convex inequalities 


fi) <0, 121, EPR, 


where f,’s are convex functions of x in C!. 


4. Consider the linear program (5) and assume that F p = {x: Ax=b, x > 0} is nonempty 
and its optimal solution set is bounded. Show that the dual of the problem has a nonempty 
interior. 


5. (Farkas’ lemma) Prove: Exactly one of the feasible sets {x: Ax =b, x > 0} and 
{y: y'A <0, y’b=1} is nonempty. A vector y in the latter set is called an infeasibility 
certificate for the former. 


6. (Strict complementarity) Consider any linear program in standard form and its dual and 
let both of them be feasible. Then, there always exists a strictly complementary solution 
pair, (x*, y*, s*), such that 


x;8;=0 and x;+s;>0 for all j. 


Moreover, the supports of x* and s*, P* = {j: x; > O} and Z* = {j: x7 > O}, are invariant 
among all strictly complementary solution pairs. 


7. (Central path theorem) Let (x(), y(t), S(w)) be the central path of (9). Then prove 
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(a) The central path point (x(w), y(), 8()) is bounded for 0 < w < pv? and any given 
0 <p? <a. 
(b) ForO0 <p’ <p, 
e’x(u’) <e7x(u) and b’y(u') > b’y(u). 
Furthermore, if x(u’) 4 x(w) and y(w’) 4 y(p), 
e’x(u') <e'x(u) and b’y(u') > b’ y(n). 
(c) (x(#), y(w), S()) converges to an optimal solution pair for (LP) and (LD). 
Moreover, the limit point x(0)p.« is the analytic center on the primal optimal face, 


and the limit point s(0)7. is the analytic center on the dual optimal face, where 
(P*, Z*) is the strict complementarity partition of the index set {1,2,..., n}. 


. Consider a primal-dual interior point (x, y,s) € N(7) where 7 < 1. Prove that there is a 
fixed quantity 5 > 0 such that 


x; 26, for all j < P* 
and 

sje 6, for all j € Z*, 
where (P*, Z*) is defined in Exercise 6. 
. (Potential level theorem) Define the potential level set 

(5) = {(K,Y,8) CF! Wnyp(X,8) < 5}. 

Prove 
(a) 


W(S')cC WS) if 8 < Ss. 


(b) For every 5, (8) is bounded and its closure (5) has non-empty intersection with 
the solution set. 


10. Given 0 < x,0 <se EE", show that 


nlog(x's) — > log(x;s;) > nlogn 
j=l 


and 


x’ S <exp 
Pp 
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11. (Logarithmic approximation) If d € E" such that |d|,, < 1 then 


|d|* 


17d > )o log(1+d;) > 1’d -—- ———__. 
> 2(1 —|d|..) 


i=l 
[Note:If d = (d,, d>,...d,,) then |d|,, = max,{d,}.] 


12. Let the direction (d,,dy,d,) be generated by system (14) with y = n/(n+p) and 
yu =x's/n, and let the step size be 


we 04/ ees (21) 
|(XS)-'/?( 55 1 —Xs)| 


(n+p) 


where @ is a positive constant less than 1. Let 
x'=x+ad,, y'=y+ad,, and st=s+ad,. 


Then, using Exercise 11 and the concavity of the logarithmic function show 
(xt, y*, st) €F and 


Wntp (x*, s*) a Snap (x, s) 


Aah (n+p) e 
< —6y/min(Xs) |(Xs)"7(1— <= Xs)|-+ 9 


13. Let v= Xs in Exercise 12. Prove 


Vinings) W224i > yaya, 


1’v 


where V is the diagonal matrix of v. Thus, the two exercises imply 


2 


Pri p(X", s*) ~ Un+p(X s) S =e al 2(1 — 6) _ 


for a constant 6. One can verify that 5 > 0.2 when 0 = 0.4. 


14. Prove property (19) for (HDSP). 
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PART Il 
UNCONSTRAINED 
PROBLEMS 


Chapter6 TRANSPORTATION 
AND NETWORK 
FLOW PROBLEMS 


There are a number of problems of special structure that are important components 
of the subject of linear programming. A broad class of such special problems 
is represented by the transportation problem and related problems treated in the 
first five sections of this chapter, and network flow problems treated in the last 
three sections. These problems are important because, first, they represent broad 
areas of applications that arise frequently. Indeed, many of these problems were 
originally formulated prior to the general development of linear programming, and 
they continue to arise in a variety of applications. Second, these problems are 
important because of their associated rich theory, which provides important insight 
and suggests new general developments. 

The chapter is roughly divided into two parts. In the first part the transportation 
problem is examined from the viewpoint of the revised simplex method, which 
takes an extremely simple form for this problem. The second part of the chapter 
introduces graphs and network flows. The transportation algorithm is generalized 
and given new interpretations. Next, a special, highly efficient algorithm, the tree 
algorithm, is developed for solution of the maximal flow problem. 


6.1 THE TRANSPORTATION PROBLEM 


The transportation problem was stated briefly in Chapter 2. We restate it here. There 
are m origins that contain various amounts of a commodity that must be shipped 
to n destinations to meet demand requirements. Specifically, origin i contains an 
amount a;, and destination j has a requirement of amount bj. It is assumed that the 
system is balanced in the sense that total supply equals total demand. That is, 


a “=>. b;. (1) 


The numbers a; and b;,i=1,2,...,m; j=1,2,...,n, are assumed to be nonneg- 
ative, and in many applications they are in fact nonnegative integers. There is a unit 
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cost c;; associated with the shipping of the commodity from origin i to destination 
j. The problem is to find the shipping pattern between origins and destinations that 
satisfies all the requirements and minimizes the total shipping cost. 

In mathematical terms the above problem can be expressed as finding a set of 
X;;'8, beAy2 09s =H 123605 tO 


m n 


minimize )°)>> CisXiy 


i=1 j=l 


subjectts } aya), for f= 1, 2, ns (2) 


a= 5, for JHA, 26 sag A 


x;,20 for alli and j. 


This mathematical problem, together with the assumption (1), is the general trans- 
portation problem. In the shipping context, the variables x;; represent the amounts 
of the commodity shipped from origin 7 to destination /. 

The structure of the problem can be seen more clearly by writing the constraint 
equations in standard form: 


Xp FX FX, =a, 
Xo) XQ7 Ft + Xn =a) 


Xinl + Xing sees + Xinn = 4y 


X12 + Xg9 + Xing = b, 


Xin + X27 + Linn = b 


n 


The structure is perhaps even more evident when the coefficient matrix A of the 
system of equations above is expressed in vector—matrix notation as 


1 


- 


where 1 = (1,1,..., 1) is n-dimensional, and where each I is an n x n identity 
matrix. 
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In practice it is usually unnecessary to write out the constraint equations of the 
transportation problem in the explicit form (3). A specific transportation problem 
is generally defined by simply presenting the data in compact form, such as: 


a= (a), a), .--, Gn) Ci Cig *** Cin 
c= C21 C22 + °° Con 
b= (b,, bo, so Stan9 b,) Omi ©m2°°* ©mn 


The solution can also be represented by an m x n array, and as we shall see, all 
computations can be made on arrays of a similar dimension. 


Example 1. As an example, which will be solved completely in a later section, a 
specific transportation problem with four origins and five destinations is defined by 


a = (30, 80, 10, 60) 34689 
eu|22453 
NS 29: 
b=(10, 50, 20, 80, 20) 33242 


Note that the balance requirement is satisfied, since the sum of the supply and the 
demand are both 180. 


Feasibility and Redundancy 


A first step in the study of the structure of the transportation problem is to show 
that there is always a feasible solution, thus establishing that the problem is well 
defined. A feasible solution can be found by allocating shipments from origins to 
destinations in proportion to supply and demand requirements. Specifically, let S 
be equal to the total supply (which is also equal to the total demand). Then let 
X;; = a;b;/S fori=1,2,...,m; j=1,2,...,n. The reader can easily verify that 
this is a feasible solution. We also note that the solutions are bounded, since each 
x;; is bounded by a; (and by b;). A bounded program with a feasible solution has 
an optimal solution. Thus, a transportation problem always has an optimal solution. 

A second step in the study of the structure of the transportation problem is based 
on a simple examination of the constraint equations. Clearly there are m equations 
corresponding to origin constraints and n equations corresponding to destination 
constraints—a total of n +m. However, it is easily noted that the sum of the origin 
equations is 


Xi = > a;, (5) 


2 yxy = » bj. (6) 
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The left-hand sides of these equations are equal. Since they were formed by two 
distinct linear combinations of the original equations, it follows that the equations 
in the original system are not independent. The right-hand sides of (5) and (6) 
are equal by the assumption that the system is balanced, and therefore the two 
equations are, in fact, consistent. However, it is clear that the original system of 
equations is redundant. This means that one of the constraints can be eliminated 
without changing the set of feasible solutions. Indeed, any one of the constraints 
can be chosen as the one to be eliminated, for it can be reconstructed from those 
remaining. The above observations are summarized and slightly extended in the 
following theorem. 


Theorem. A transportation problem always has a solution, but there is exactly 
one redundant equality constraint. When any one of the equality constraints 
is dropped, the remaining system of n+-m— 1 equality constraints is linearly 
independent. 


Proof. The existence of a solution and a redundancy were established above. The 
sum of all origin constraints minus the sum of all destination constraints is identically 
zero. It follows that any constraint can be expressed as a linear combination of the 
others, and hence any one constraint can be dropped. 

Suppose that one equation is dropped, say the last one. Suppose that there were 
a linear combination of the remaining equations that was identically zero. Let the 
coefficients of such a combination be a, i= 1, 2,..., m, and Bj jJ=1,2,...,n—-1. 
Referring to (3), it is seen that each x,,, i= 1,2,...,m, appears only in the ith 
equation (since the last one has been dropped). Thus a; = 0 for i=1,2,...,2. 
In the remaining equations x;; appears in only one equation, and hence B; = 0, 
j=1,2,...,n—1. Hence the only linear combination that yields zero is the zero 
combination, and therefore the system of equations is linearly independent. ff 


It follows from the above discussion that a basis for the transportation problem 
consists of m-+n-—1 vectors, and a nondegenerate basic feasible solution consists 
of m+n-—1 variables. The simple solution found earlier in this section is clearly 
not a basic solution. 


6.2 FINDING A BASIC FEASIBLE SOLUTION 


There is a straightforward way to compute an initial basic feasible solution to a trans- 
portation problem. The method is worth studying at this stage because it introduces 
the computational process that is the foundation for the general solution technique 
based on the simplex method. It also begins to illustrate the fundamental property of 
the structure of transportation problems that is discussed in the next section. 


The Northwest Corner Rule 


This procedure is conducted on the solution array shown below: 
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Ay | X12 | %13 Xin | 4 
X21 | X22 | *%23 Xan | 42 
(7) 
Xint X n2 Xmn3 Xmn An 
Db b, b; BD 


The individual elements of the array appear in cells and represent a solution. An 
empty cell denotes a value of zero. 
Beginning with all empty cells, the procedure is given by the following steps: 


Step I. Start with the cell in the upper left-hand corner. 


Step 2. Allocate the maximum feasible amount consistent with row and column 
sum requirements involving that cell. (At least one of these requirements will then 
be met.) 


Step 3. Move one cell to the right if there is any remaining row requirement 
(supply). Otherwise move one cell down. If all requirements are met, stop; otherwise 
go to Step 2. 


The procedure is called the Northwest Corner Rule because at each step it 
selects the cell in the upper left-hand corner of the subarray consisting of current 
nonzero row and column requirements. 


Example 1. A basic feasible solution constructed by the Northwest Corner Rule 
is shown below for Example 1 of the last section. 


10 | 20 30 
30 | 20 | 30 80 
10 10 (8) 
40 | 20 | 60 
10 | 50 | 20 | 80 | 20 


In the first step, at the upper left-hand corner, a maximum of 10 units could be 
allocated, since that is all that was required by column 1. This left 30 — 10 = 20 
units required in the first row. Next, moving to the second cell in the top row, the 
remaining 20 units were allocated. At this point the row | requirement is met, and it 
is necessary to move down to the second row. The reader should be able to follow 
the remaining steps easily. 

There is the possibility that at some point both the row and column requirements 
corresponding to a cell may be met. The next entry will then be a zero, indicating a 
degenerate basic solution. In such a case there is a choice as to where to place the 
zero. One can either move right or move down to enter the zero. Two examples of 
degenerate solutions to a problem are shown below: 
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30 30 30 30 
20 | 20 40 20 | 20 | 0 40 
0 | 20 20 20 20 
20 | 40 | 60 20 | 40 | 60 

50 | 20 | 40 | 40 50 | 20 | 40 | 40 


It should be clear that the Northwest Corner Rule can be used to obtain different 
basic feasible solutions by first permuting the rows and columns of the array before 
the procedure is applied. Or equivalently, one can do this indirectly by starting the 
procedure at an arbitrary cell and then considering successive rows and columns in 
an arbitrary order. 


6.3 BASIS TRIANGULARITY 


We now establish the most important structural property of the transportation 
problem: the triangularity of all bases. This property simplifies the process of 
solution of a system of equations whose coefficient matrix corresponds to a basis, 
and thus leads to efficient implementation of the simplex method. 


Triangular Matrices 


The concept of upper and lower triangular matrices was introduced earlier 
in Section 3.8 in connection with Gaussian elimination methods. (Also see 
Appendix C.) It is useful at this point to generalize slightly the notion of upper and 
lower triangularity. 


Definition. A nonsingular square matrix M is said to be triangular if by a 
permutation of its rows and columns it can be put in the form of a lower 
triangular matrix. 


Clearly a nonsingular lower triangular matrix is triangular according to the 
above definition. A nonsingular upper triangular matrix is also triangular, since by 
reversing the order of its rows and columns it becomes lower triangular. 

There is a simple and useful procedure for determining whether a given matrix 
M is triangular: 


Step 1. Find a row with exactly one nonzero entry. 


Step 2. Form a submatrix of the matrix used in Step | by crossing out the row 
found in Step | and the column corresponding to the nonzero entry in that row. 
Return to Step | with this submatrix. 


If this procedure can be continued until all rows have been eliminated, then the 
matrix is triangular. It can be put in lower triangular form explicitly by arranging 
the rows and columns in the order that was determined by the procedure. 
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Example 1. Shown below on the left is a matrix before the above procedure is 
applied to it. Indicated along the edges of this matrix is the order in which the rows 
and columns are indexed according to the procedure. Shown at the right is the same 
matrix when its rows and columns are permuted according to the order found. 


1 2 0 1 0 2 4 4 0 0 0 0 0 
4 10 5 0 0 3 1 2 0 0 0 0 
0 0 0 4 0 0 1 5 1 4 0 0 0 
2 1 7 2 1 =3 6 1 2 1 2 0 0 
2 3 2 0 0 3 5 0 3 2 3 2 0 
0 2 0 1 0 0 2 >» I 2 Bo F A 
3 2 5 1 6 4 
Triangularization 


The importance of triangularity is, of course, the associated method of back 
substitution for the solution of a triangular system of equations. Suppose that 
M is triangular. A permutation of rows is simply a reordering of the equations, 
and a permutation of columns is simply a reordering of the variables. So after 
appropriate reordering, the system of equations Mx = d takes a lower triangular 
form and it can be solved in the familiar way: by first solving for x, from the first 
equation, then substituting this value into the second equation to solve for x,, and 
so forth. 

This method also applies to systems of the form A’M =c’. In this case the 
components of A will be determined in reverse order, starting with A,. This is 
because the system, when written in standard column form, has coefficient matrix 
M’, which is upper triangular. The upper triangular form corresponds to that 
obtained by standard Gaussian elimination applied to an arbitrary system, and this 
accounts for the terminology “back substitution” as discussed in Appendix C. 


Triangular Bases 


We are now prepared to derive the most important structural property of the trans- 
portation problem. 


Basis Triangularity Theorem. Every basis of the transportation problem is 
triangular. 


Proof. Refer to the system of constraints (3). Let us change the sign of the top 
half of the system; then the coefficient matrix of the system consists of entries that 
are either +1, —1, or 0. Following the result of the theorem in Section 6.1, delete 
any one of the equations to eliminate the redundancy. From the resulting coefficient 
matrix, form a basis B by selecting a nonsingular subset of m-++n— 1 columns. 
Each column of B contains at most two nonzero entries, a +1 and a —1. Thus 
there are at most 2(m-+-n—1) nonzero entries in the basis. However, if every 
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column contained two nonzero entries, then the sum of all rows would be zero, 
contradicting the nonsingularity of B. Thus at least one column of B must contain 
only one nonzero entry. This means that the total number of nonzero entries in B 
is less than 2(m-+n—1). It then follows that there must be a row with only one 
nonzero entry; for if every row had two or more nonzero entries, the total number 
would be at least 2(m-+n—1). This means that the first step of the procedure 
for verifying triangularity is satisfied. A similar argument can be applied to the 
submatrix of B obtained by crossing out the row with the single nonzero entry and 
the column corresponding to that entry; that submatrix must also contain a row with 
a single nonzero entry. This argument can be continued, establishing that the basis 
B is triangular. Jf 


Example 2. As an illustration of the Basis Triangularity Theorem, consider the 
basis selected by the Northwest Corner Rule in Example | of Section 6.2. This 
basis is represented below, except that only the basic variables are indicated, not 
their values. 


x x 30 
x x x 80 
x 10 
x x | 60 
10 | 50 | 20 | 80 | 20 


A row in a basis matrix corresponds to an equation in the original system and is 
associated with a constraint either on a row or column sum in the solution array. In 
this example the equation corresponding to the first column sum contains only one 
basis variable, x,,. The value of this variable can be found immediately to be 10. 
The next equation corresponds to the first row sum. The corresponding variable is 
X;2, which can be found to be 20, since x,, is known. Progression in this manner 
through the basis variables is equivalent to back substitution. 


Example 3. Represented below is another basis for Example 2. We must scan 
the rows and columns to find one with a single basic variable. The value of this 
variable can be easily found. Such a row or column always exists, since every basis 
is triangular. Then this row or column is crossed out and the procedure repeated. 
The numbers in the cells indicate an acceptable order of computation, although 
there are several others. 


x ! x >? x © | 30 
x ? x 3 80 
x 4 10 


x ®}x 7] 60 
10 50 20 80 20 
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Integer Solutions 


Since any basis matrix is triangular and all nonzero elements are equal to one (or 
minus one if the signs of some equations are changed), it follows that the process 
of back substitution will simply involve repeated additions and subtractions of the 
given row and column sums. No multiplication is required. It therefore follows that 
if the original row and column totals are integers, the values of all basic variables 
will be integers. This is an important result, which we summarize by a corollary to 
the Basis Triangularity Theorem. 


Corollary. If the row and column sums of a transportation problem are 
integers, then the basic variables in any basic solution are integers. 


6.4 SIMPLEX METHOD FOR TRANSPORTATION 
PROBLEMS 


Now that the structural properties of the transportation problem have been 
developed, it is a relatively straightforward task to work out the details of the 
simplex method for the transportation problem. A major objective is to exploit fully 
the triangularity property of bases in order to achieve both computational efficiency 
and a compact representation of the method. The method used is actually a direct 
adaptation of the version of the revised simplex method presented in the first part of 
Section 3.7. The basis is never inverted; instead, its triangular form is used directly 
to solve for all required variables. 


Simplex Multipliers 


Simplex multipliers are associated with the constraint equations. In this case we 
partition the vector of multipliers as A = (u, v). Here, u; represents the multiplier 
associated with the ith row sum constraint, and v; represents the multiplier associated 
with the jth column sum constraint. Since one of the constraints is redundant, an 
arbitrary value may be assigned to any one of the multipliers (see Exercise 4, 
Chapter 4). For notational simplicity we shall at this point set v, = 0. 

Given a basis B, the simplex multipliers are found to be the solution to the 
equation A’B = cg. To determine the explicit form of these equations, we again 
refer to the original system of constraints (3). If x;; is basic, then the corresponding 
column from A will be included in B. This column has exactly two +1 entries: 
one in the ith position of the top portion and one in the jth position of the bottom 
portion. This column thus generates the simplex multiplier equation u;+ vu; = ¢;;, 
since u; and v; are the corresponding components of the multiplier vector. Overall, 
the simplex multiplier equations are 


Uu; + U; = Cj, (9) 
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for all i,j for which x;; is basic. The coefficient matrix of this system is the 
transpose of the basis matrix and hence it is triangular. Thus, this system can be 
solved by back substitution. This is similar to the procedure for finding the values 
of basic variables and, accordingly, as another corollary of the Triangular Basis 
Theorem, an integer property holds for simplex multipliers. 


Corollary. If the unit costs c;; of a transportation problem are all integers, 
then (assuming one simplex multiplier is set arbitrarily equal to an integer) 
the simplex multipliers associated with any basis are integers. 


Once the simplex multipliers are known, the relative cost coefficients for 
nonbasic variables can be found in the usual manner as rj = cy — A’D. In this case 
the relative cost coefficients are 


j=l, 2, ..., 0. (10) 


This relation is valid for basic variables as well if we define relative cost coefficients 
for them—having value zero. 

Given a basis, computation of the simplex multipliers is quite similar to the 
calculation of the values of the basic variables. The calculation is easily carried out 
on an array of the form shown below, where the circled elements correspond to the 
positions of the basic variables in the current basis. 


C1 Ep) C3 tt Cn | UY 
C21 €2) C93 tt Can | Ue 


Vv} Vy ss Vv 


In this case the main part of the array, with the coefficients c;;, remains fixed, and 
we calculate the extra column and row corresponding to u and v. 
The procedure for calculating the simplex multipliers is this: 


Step I. Assign an arbitrary value to any one of the multipliers. 


Step 2. Scan the rows and columns of the array until a circled element c;; is found 
such that either u; or uv; (but not both) has already been determined. 


Step 3. Compute the undetermined wu; or v; from the equation c;; = u; + v;. If all 
multipliers are determined, stop. Otherwise, return to Step 2. 


The triangularity of the basis guarantees that this procedure can be carried 
through to determine all the simplex multipliers. 


Example 1. Consider the cost array of Example | of Section 5.1, which is shown 
below with the circled elements corresponding to a basic feasible solution (found 
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by the Northwest Corner Rule). Only these numbers are used in the calculation of 
the multipliers. 


@®@@6 sa 9 
2 @@@ 5 
2. 3 3 i> 2 
3 3 2 @0O 


We first arbitrarily set v; = 0. We then scan the cells, searching for a circled 
element for which only one multiplier must be determined. This is the bottom right 
corner element, and it gives u, = 2. Then, from the equation 4= 2+ v,, vy is found 
to be 2. Next, wu, and uw, are determined, then v, and v,, and finally u, and v,. The 
result is shown below: 


@® @6 8 95 
2 @ @ © 513 
2 3 2 @) 2 \1 
3 3 2 @ @l2 
vi =. 1. 2. 8 


Cycle of Change 


In accordance with the general simplex procedure, if a nonbasic variable has an 
associated relative cost coefficient that is negative, then that variable is a candidate 
for entry into the basis. As the value of this variable is gradually increased, the 
values of the current basic variables will change continuously in order to maintain 
feasibility. Then, as usual, the value of the new variable is increased precisely to 
the point where one of the old basic variables is driven to zero. 

We must work out the details of how the values of the current basic variables 
change as a new variable is entered. If the new basic vector is d, then the change 
in the other variables is given by —B~'d, where B is the current basis. Hence, once 
again we are faced with a problem of solving a system associated with the triangular 
basis, and once again the solution has special properties. In the next theorem recall 
that A is defined by (4). 


Theorem. Let B be a basis from A (ignoring one row), and let d be another 
column. Then the components of the vector y =B™'d are either 0, +1, or -1. 


Proof. Let y be the solution to the equation By = d. Then y is the representation 
of d in terms of the basis. This equation can be solved by Cramer’s rule as 


_ det B, 
~ det B’ 


Yk 


where B, is the matrix obtained by replacing the kth column of B by d. Both B and 
B, are submatrices of the original constraint matrix A. The matrix B may be put 
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in triangular form with all diagonal elements equal to +1. Hence, accounting for 
the sign change that may result from the combined row and column interchanges, 
det B = +1 or —1. Likewise, it can be shown (see Exercise 3) that det B, = 0, +1, 
or —1. We conclude that each component of y is either 0, +1, or —1.] 


The implication of the above result is that when a new variable is added to the 
solution at a unit level, the current basic variables will each change by +1, —1, 
or 0. If the new variable has a value 0, then, correspondingly, the basic variables 
change by +6, —8, or 0. It is therefore only necessary to determine the signs of 
change for each basic variable. 

The determination of these signs is again accomplished by row and column 
scanning. Operationally, one assigns a + to the cell of the entering variable to 
represent a change of +0, where @ is yet to be determined. Then +’s, —’s, and 0’s 
are assigned, one by one, to the cells of some basic variables, indicating changes 
of +0, —6, or 0 to maintain a solution. As usual, after each step there will always 
be an equation that uniquely determines the sign to be assigned to another basic 
variable. The result will be a sequence of pluses and minuses assigned to cells that 
form a cycle leading from the cell of the entering variable back to that cell. In 
essence, the new change is part of a cycle of redistribution of the commodity flow 
in the transportation system. 

Once the sequence of +’s, —’s, and 0’s is determined, the new basic feasible 
solution is found by setting the level of the change 6. This is set so as to drive one 
of the old basic variables to zero. One must simply examine those basic variables for 
which a minus sign has been assigned, for these are the ones that will decrease as 
the new variable is introduced. Then @ is set equal to the smallest magnitude of these 
variables. This value is added to all cells that have a + assigned to them and subtracted 
from all cells that have a — assigned. The result will be the new basic feasible solution. 

The procedure is illustrated by the following example. 


Example 3. A completed solution array is shown below: 


10° 10 

20- 10* | 30 

20 | 10° 30° | 60 

10° 10 

10- + | 40° 50 
40 | 10 | 30 | 40 | 40 


In this example x,, is the entering variable, so a plus sign is assigned there. The signs 
of the other cells were determined in the order x13, ¥33, X95, X35, X32» X31. X41» X51> Xsq- 
The smallest variable with a minus assigned to it is x5; = 10. Thus we set 6 = 10. 


The Transportation Algorithm 


It is now possible to put together the components developed to this point in the form of 
a complete revised simplex procedure for the transportation problem. The steps are: 
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Step 1. Compute an initial basic feasible solution using the Northwest Corner Rule 
or some other method. 


Step 2. Compute the simplex multipliers and the relative cost coefficients. If all 
relative cost coefficients are nonnegative, stop; the solution is optimal. Otherwise, 
go to Step 3. 


Step 3. Select a nonbasic variable corresponding to a negative cost coefficient to 
enter the basis (usually the one corresponding to the most negative cost coefficient). 
Compute the cycle of change and set 6 equal to the smallest basic variable with a 
minus assigned to it. Update the solution. Go to Step 2. 


Example 4. We can now completely solve the problem that was introduced in 
Example | of the first section. The requirements and a first basic feasible solution 
obtained by the Northwest Corner Rule are shown below. The plus and minus 
signs indicated on the array should be ignored at this point, since they cannot be 
computed until the next step is completed. 


10 | 20 30 
30 | 20 | 30* 80 
10° 10 


+ | 40- | 20° | 60 
10 | 50 | 20 | 80 | 20 


The cost coefficients of the problem are shown in the array below, with the circled 
cells corresponding to the current basic variables. The simplex multipliers, computed 
by row and column scanning, are shown as well. 


NR WMN 


IOV * 
WIOvuse 


The relative cost coefficients are found by subtracting u;+ v, from c;;. In this case 
the only negative result is in cell 4,3; so variable x,, will be brought into the basis. 
Thus a + is entered into this cell in the original array, and the cycle of zeros and 
plus and minus signs is determined as shown in that array. (It is not necessary to 
continue scanning once a complete cycle is determined.) 

The smallest basic variable with a minus sign is 20 and, accordingly, 20 is 
added or subtracted from elements of the cycle as indicated by the signs. This leads 
to the new basic feasible solution shown in the array below: 
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10 | 20 30 
30 50 80 
10 10 


20 | 20 | 20 | 60 
10 | 50 | 20 | 80 | 20 


The new simplex multipliers corresponding to the new basis are computed, and 
the cost array is revised as shown below. In this case all relative cost coefficients 
are positive, indicating that the current solution is optimal. 


® @ 
2 @ 
2 2 
3.3 
=a =i 


NFR WN 


6 
4 
2 
@ 
0 


Degeneracy 


As in all linear programming problems, degeneracy, corresponding to a basic 
variable having the value zero, can occur in the transportation problem. If degen- 
eracy is encountered in the simplex procedure, it can be handled quite easily by 
introduction of the standard perturbation method (see Exercise 15, Chapter 3). In 
this method a zero-valued basic variable is assigned the value e and is then treated 
in the usual way. If it later leaves the basis, then the ¢ can be dropped. 


Example 5. To illustrate the method of dealing with degeneracy, consider a 
modification of Example 4, with the fourth row sum changed from 60 to 20 and the 
fourth column sum changed from 80 to 40. Then the initial basic feasible solution 
found by the Northwest Corner Rule is degenerate. An « is placed in the array for 
the zero-valued basic variable as shown below: 


10 | 20 30 
30 | 20 | 30* 80 
10° 10 


+ e | 20° | 20 
10 | 50 | 20 40 20 


The relative cost coefficients will be the same as in Example 4, and hence again 
X4; Should be chosen to enter, and the cycle of change is the same as before. In 
this case, however, the change is only ¢, and variable x,, leaves the basis. The new 
relative cost coefficients are all positive, indicating that the new solution is optimal. 
Now the € can be dropped to yield the final solution (which is, itself, degenerate 
in this case). 
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10 | 20 30 
30 | 20 | 30 80 
10 10 
€ 20 | 20 
10 | 50 | 20 | 40 | 20 


6.5 THE ASSIGNMENT PROBLEM 


The assignment problem is a very special case of the transportation problem for two 
reasons. First, the areas of application in which it arises are usually quite distinct 
from those of the more general transportation problem; and second, its unique 
structure is of theoretical significance. 

The classic example of the assignment problem is that of optimally assigning 
n workers to n jobs. If worker i is assigned to job j, there is a benefit of c;;. Each 
worker must be assigned to exactly one job, and each job must have one assigned 
worker. One wishes to make the assignment in such a way as to maximize (in this 
example) the total value of the assignment. 


The general formulation of the assignment problem is to find Xjj> i (7 2 
jJ=1,2,...,n to 
minimize 0) cx; 
j=li=l 
subjectto }) 4,=1 for i=1, 2, ..., 7 (11) 
j=l 
dapat for fodn2, cg n 
i=l 
x;; 20 for t= 1, 2, Jn 
ia ee Jn 


In the motivating examples, it is actually required that each of the variables x;; take 
the values 0 or 1—otherwise the solution is not meaningful, since it is not possible 
to make fractional assignments. In the mathematical description, we relax the 
integer assumption and instead formulate the problem as a true linear programming 
problem. As stated in the theorem below, this actually leads to the desired result. 


Theorem. Any basic feasible solution of the assignment problem has every 
x;; equal to either zero or one. 


Proof. According to the corollary of the Basis Triangularity Theorem, all basic 
variables in any basic solution are integers. Clearly, no variable can exceed | 
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because the right-hand sides of the constraint equations are all 1. Therefore, all 
variables must be either zero or one. ff 


It follows that there are at most n basic variables that have the value 1 
because there can be at most a single one in each row (and in each column). In a 
general transportation problem of this dimension, however, a nondegenerate basic 
solution would have 2n — 1 positive variables. Thus, basic feasible solutions to 
the assignment problem are highly degenerate, with n— 1 basic variables equal 
to zero. 

The assignment problem can be solved, of course, by use of the general trans- 
portation algorithm described in Section 6.4. It is a bit tedious to do so, however, 
because of the highly degenerate nature of basic feasible solutions. A highly efficient 
special algorithm was developed for the assignment problem, based on the work of 
two Hungarian mathematicians, and this method was later generalized to form the 
primal-dual method for linear programming. 


6.6 BASIC NETWORK CONCEPTS 


We now begin a study of an entirely different topic in linear programming: graphs 
and flows in networks. It will be seen, however, that this topic provides a foundation 
for a wide assortment of linear programming applications and, in fact, provides 
a different approach to the problems considered in the first part of this chapter. 
This section covers some of the basic graph and network terminology and concepts 
necessary for the development of this alternative approach. 


Definition. A graph consists of a finite collection of elements called nodes 
together with a subset of unordered pairs of the nodes called arcs. 


The nodes of a graph are usually numbered, say, 1, 2, 3, ..., n. An arc between 
nodes i and j is then represented by the unordered pair (i, j). A graph is typically 
represented as shown in Fig. 6.1. The nodes are designated by circles, with the 
number inside each circle denoting the index of that node. The arcs are represented 
by the lines between the nodes. 


Fig. 6.1 A graph 
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There are a number of other elementary definitions associated with graphs 
that are useful in describing their structure. A chain between nodes i and j 
is a sequence of arcs connecting them. The sequence must have the form 
(i, k,), (ky, ka), (ko, ks), +++ (Ky > J). In Fig. 6.1, (1, 2), (2, 4), (4, 3) is a chain 
between nodes | and 3. If a direction of movement along a chain is specified—say 
from node i to node j—it is then called a path from i to j. A cycle is a chain 
leading from node i back to node i. The chain (1, 2), (2, 4), (4, 3), (3, 1) is a cycle 
for the graph in Fig. 6.1. 

A graph is connected if there is a chain between any two nodes. Thus, the 
graph of Fig. 6.1 is connected. A graph is a tree if it is connected and has no cycles. 
Removal of any one of the arcs (1, 2), (1, 3), (2, 4), (3, 4) would transform the 
graph of Fig. 6.1 into a tree. Sometimes we consider a tree within a graph G, which 
is just a tree made up of a subset of arcs from G. Such a tree is a spanning tree if 
it touches all nodes of G. It is easy to see that a graph is connected if and only if 
it contains a spanning tree. 

Our interest will focus primarily on directed graphs, in which a sense of 
orientation is given to each arc. In this case an arc is considered to be an ordered 
pair of nodes (i,j), and we say that the arc is from node i to node j. This is 
indicated on the graph by having an arrow on the arc pointing from i to j as shown 
in Fig. 6.2. When working with directed graphs, some node pairs may have an arc 
in both directions between them. Rather than explicitly indicating both arcs in such 
a case, it is customary to indicate a single undirected arc. The notions of paths and 
cycles can be directly applied to directed graphs. In addition we say that node j is 
reachable from i if there is a path from node i to /. 

In addition to the visual representation of a directed graph characterized by 
Fig. 6.2, another common method of representation is in terms of a graph’s node-are 
incidence matrix. This is constructed by listing the nodes vertically and the arcs 
horizontally. Then in the column under arc (i, j), a +1 is placed in the position 
corresponding to node i and a —1 is placed in the position corresponding to node 
j. The incidence matrix for the graph of Fig. 6.2 is shown in Table 6.1. 


CQ) @) 
©) 


Fig. 6.2 A directed graph 
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(4,2) €,4) (2,3) (2,4) (4,2) 
1 1 
-1 1 1 —1 
| 
-1 -1 1 


BRWNe 


Table 6.1 Incidence matrix for example 


Clearly, all information about the structure of the graph is contained in the 
node-arc incidence matrix. This representation is often very useful for computational 
purposes, since it is easily stored in a computer. 


Flows in Networks 


A graph is an effective way to represent the communication structure between 
nodes. When there is the possibility of flow along the arcs, we refer to the directed 
graph as a network. In applications the network might represent a transportation 
system or a communication network, or it may simply be a representation used for 
mathematical purposes (such as in the assignment problem). 

A flow in a given directed are (i, j) is a number x;; > 0. Flows in the arcs of 
the network must jointly satisfy a conservation criterion at each node. Specifically, 
unless the node is a source or sink as discussed below, flow cannot be created or 
lost at a node; the total flow into a node must equal the total flow out of the node. 
Thus at each such node i 


n n 


Xijp- ye xX, = 0. 


j=l k=1 


The first sum is the total flow from i, and the second sum is the total flow fo i. 
(Of course x;; does not exist if there is no arc from i to j.) It should be clear that 
for nonzero flows to exist in a network without sources or sinks, the network must 
contain a cycle. 

In many applications, some nodes are in fact designated as sources or sinks (or, 
alternatively, supply nodes or demand nodes). The net flow out of a source may be 
positive, and the level of this net flow may either be fixed or variable, depending 
on the application. Similarly, the net flow into a sink may be positive. 


6.7 MINIMUM COST FLOW 


In this section we consider the basic minimum cost flow problem, which slightly 
generalizes the transportation problem. The primary objective of this section is to 
develop a network interpretation of the concepts for the transportation problem 
previously developed principally in algebraic terms. 
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Consider a network having n nodes. Corresponding to each node i, there is a 
number b, representing the available supply at the node. (If b; < 0, then there is a 
required demand.) We assume that the network is balanced in the sense that 


n 


yb, =0. 


i=1 


Associated with each arc (i, j) is a number c;,, representing the unit cost for 
flow along this arc. The minimal cost flow problem is that of determining flows 
X;; = 0 in each arc of the network so that the net flow into each node i is b; while 
minimizing the total cost. In mathematical terms the problem is 

minimize )° ¢,;x;; 
n n 
subject to }) x;;- > x =D; i=1,2,...,n (12) 


j=l k=l 


x; 20, by JS 1s 2x hang Me 


The transportation problem is a special case of this problem, corresponding 
to a network with arcs going only from supply to demand nodes, which reflects 
the fact that shipping is restricted in that problem to be directly from a supply 
node to a demand node. The more general problem allows for arbitrary network 
configurations, so that flow from a supply node may progress through several 
intermediate nodes before reaching its destination. The more general problem is 
often termed the transshipment problem. 


Problem Structure 


Problem (12) is clearly a linear program. The coefficient matrix A of the flow 
constraints is the node-arc incident matrix of the network. The column corre- 
sponding to arc (i, j) has a +1 entry in row i and a —1 entry in row j. It follows 
that, since the sum of all rows is the zero vector, the matrix A has rank of at most 
n—1, and any row of A can be dropped to obtain a coefficient matrix of rank equal 
to that of the original. We shall show, using network concepts, that the rank of the 
coefficient matrix is indeed n — | under a simple connectivity assumption on the 
network. 

To state the required assumption precisely, we define the undirected graph G of 
the network. Each arc of the network is included in G, independent of its direction. 
(The orientation of arcs is not considered here because we are only interested in 
linear properties of A.) We must assume that G is connected. This implies that G 
contains at least one spanning tree. 

Now, to proceed with the proof that the rank of A is n—1, select any arbitrary 
row to drop from A, and denote the corresponding new matrix by A. Consider any 
spanning tree T in the graph G. This tree will consist of n — 1 arcs without a cycle. 
We refer to the node corresponding to the row that was dropped from A as the root 
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of the tree. Let A, be the (n— 1) x (n— 1) submatrix of A, consisting of the (n— 1) 
columns corresponding to arcs in the tree. At least two nodes of the tree must have 
only a single arc of T touching them, and at least one of these is not the root. 
This means that the corresponding row of A, has a single nonzero entry. Imagine 
that we cross out that row and the column corresponding to that entry. In terms 
of the tree, this corresponds to elimination of that node and the arc that touched 
it. The (n —2) remaining arcs in T form a tree for the reduced network of n— 1 
nodes, including the root. The procedure can therefore be repeated consecutively, 
eliminating all nodes except the root until all rows of A; are dropped. 

It is clear that the above process is equivalent to the triangularization procedure 
of Section 6.3. In other words A; is an (n—1) x (n—1) nonsingular triangular 
submatrix. It follows that A has rank equal to n— 1. 


Structure of a Basis 


We have shown above that a spanning tree of G corresponds to a basis, since it 
defines a nonsingular submatrix A. We will now show the converse. 

A basis corresponds to a choice of n— 1 linearly independent columns from A. 
Each column corresponds to an arc from the network, so a selection of a basis is 
equivalent to a selection of n— 1 arcs. We want to show that these arcs must form a 
spanning tree. Suppose that the collection of arcs corresponding to the basis contains 
a cycle consisting of, say, m arcs. When arranged as a cycle, the arcs are of the form 
(1,, 15) (My, Nz) (Nz, Ny)... (N»,1,). In this ordering, some arcs may preserve their 
original orientation and some may be reversed. Now consider the corresponding 
columns a, @,...,a,, of A. Form the linear combination +a, +a, +a;...+a,, 
where in each case the coefficient is + if the orientation of the arc is the same 
in the cycle as in the original graph and is — if not. The ith column vector in 
this combination (after accounting for the sign coefficient) corresponds to the arc 
(n;,n;,,) of the cycle and has a +1 in the row corresponding to n; and a —1 
in the row corresponding to n,,,. As a result, the +1’s and —1’s all cancel in 
the combination. Thus, the combination is the zero vector, contradicting the linear 
independence of a,, a), ..., a,,. We have therefore established that the collection of 
arcs corresponding to a basis does not contain a cycle. Since there are n— 1 arcs 
and n nodes, it is easy to conclude (but we leave it to the reader) that the arcs must 
form a spanning tree. 

We conclude from the above explanation and the earlier discussion that there 
is a direct one-to-one correspondence between the arcs (columns) in a basis and 
spanning trees. We also know that any basis is triangular; and it is also easy to 
see from the triangularity that a basis is unimodular (see Exercise 3). Therefore, 
the essential characteristics of the transportation problem carry over to the more 
general minimum cost flow problem. 

Given a basis, the corresponding basic solution can be found by back substi- 
tution using the triangular structure. In this process one looks for an equation having 
just a single undetermined basic variable corresponding to a single undetermined 
arc flow x,;. This equation is solved for x;;, and then another such equation is 


ij? 
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found. In terms of network concepts, one looks first for an end of the spanning 
tree corresponding to the basis; that is, one finds a node that touches only one arc 
of the tree. The flow in this arc is then determined by the supply (or demand) at 
that node. Back substitution corresponds to solving for flows along the arcs of the 
spanning tree, starting from an end and successively eliminating arcs. 


The Simplex Method 


The revised simplex method can easily be applied to the generalized minimum cost 
flow problem. We describe the steps below together with a brief discussion of their 
network interpretation. 


Step I. Start with a given basic feasible solution. 


Step 2. Compute simplex multipliers \; for each node i. This amounts to solving 
the equations 


Ni-Aj = Ci; (13) 


for each i, j corresponding to a basic arc. This follows because arc (i, j) corresponds 
to a column in A with a +1 at row 7 and a —1 at row j. The equations are solved 
by arbitrarily setting the value of any one multiplier. An equation with only one 
undetermined multiplier is found and that value determined, and so forth. 

The relative cost coefficients for nonbasic arcs are then 


Vig = Cig — Ai — Aj). (14) 
If all relative cost coefficients are nonnegative, stop; the solution is optimal. 
Otherwise, go to Step 3. 


Step 3. Select a nonbasic flow with negative relative cost coefficient to enter the 
basis. Addition of this arc to the spanning tree of the old basis will produce a cycle 
(see Fig. 6.3). Introduce a positive flow around this cycle of amount 6. As @ is 
increased, some old basic flows will decrease, so 0 is chosen to be equal to the 
smallest value that makes the net flow in one of the old basic arcs equal to zero. 
This variable goes out of the basis. The new spanning tree is therefore obtained by 
adding an arc to form a cycle and then eliminating one other arc from the cycle. 


Additional Considerations 


Additional features can be incorporated as in other applications of the simplex 
method. For example, an initial basic feasible solution, if one exists, can be found 
by the use of artificial variables in a phase I procedure. This can be accomplished 
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Fig. 6.3 Spanning trees of basis 


by introducing an additional node with zero supply and with an arc connected to 
each other node—directed to nodes with demand and away from nodes with supply. 
An initial basic feasible solution is then constructed with flow on these artificial 
arcs. During phase I, the cost on the artificial arcs is unity and it is zero on all other 
arcs. If the total cost can be reduced to zero, a basic feasible solution to the original 
problem is obtained. (The reader might wish to show how the above technique can 
be modified so that an additional node is not required.) 

An important extension of the problem is the inclusion of upper bounds 
(capacities) on allowable flow magnitudes in an arc., but we shall not describe the 
details here. 

Finally, it should be pointed out that there are various procedures for organizing 
the information required by the simplex method. The most straightforward procedure 
is to just work with the algebraic form defined by the node—arc incidence matrix. 
Other procedures are based on representing the network structure more compactly 
and assigning flows to arcs and simplex multipliers to nodes. 


6.8 MAXIMAL FLOW 


A different type of network problem, discussed in this section, is that of deter- 
mining the maximal flow possible from one given source node to a sink node 
under arc capacity constraints. A preliminary problem, whose solution is a funda- 
mental building block of a method for solving the flow problem, is that of simply 
determining a path from one node to another in a directed graph. 


Tree Procedure 


Recall that node j is reachable from node i in a directed graph if there is a path 
from node i to node j. For simple graphs, determination of reachability can be 
accomplished by inspection, but for large graphs it generally cannot. The problem 
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can be solved systematically by a process of repeatedly labeling and scanning 
various nodes in the graph. This procedure is the backbone of a number of methods 
for solving more complex graph and network problems, as illustrated later. It can 
also be used to establish quickly some important theoretical results. 

Assume that we wish to determine whether a path from node | to node m 
exists. At each step of the algorithm, each node is either unlabeled, labeled but 
unscanned, or labeled and scanned. The procedure consists of these steps: 


Step I. Label node 1 with any mark. All other nodes are unlabeled. 


Step 2. For any labeled but unscanned node i, scan the node by finding all 
unlabeled nodes reachable from i by a single arc. Label these nodes with an i. 


Step 3. If node m is labeled, stop; a breakthrough has been achieved—a path exists. 
If no unlabeled nodes can be labeled, stop; no connecting path exists. Otherwise, 
go to Step 2. 


The process is illustrated in Fig. 6.4, where a path between nodes | and 10 is 
sought. The nodes have been labeled and scanned in the order 1, 2, 3, 5, 6, 8, 4, 7, 
9, 10. The labels are indicated close to the nodes. The arcs that were used in the 
scanning processes are indicated by heavy lines. Note that the collection of nodes 
and arcs selected by the process, regarded as an undirected graph, form a tree—a 
graph without cycles. This, of course, accounts for the name of the process, the 
tree procedure. If one is interested only in determining whether a connecting path 
exists and does not need to find the path itself, then the labels need only be simple 
check marks rather than node indices. However, if node indices are used as labels, 
then after successful completion of the algorithm, the actual connecting path can be 
found by tracing backward from node m by following the labels. In the example, 
one begins at 10 and moves to node 7 as indicated; then to 6, 3, and 1. The path 
follows the reverse of this sequence. 

It is easy to prove that the algorithm does indeed resolve the issue of the 
existence of a connecting path. At each stage of the process, either a new node 
is labeled, it is impossible to continue, or node m is labeled and the process is 


—o) 


Fig. 6.4 The scanning procedure 
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successfully terminated. Clearly, the process can continue for at most n— | stages, 
where n is the number of nodes in the graph. Suppose at some stage it is impossible 
to continue. Let S be the set of labeled nodes at that stage and let S be the set of 
unlabeled nodes. Clearly, node | is contained in S, and node m is contained in S. If 
there were a path connecting node | with node m, then there must be an arc in that 
path from a node k in S to a node in S. However, this would imply that node k was 
not scanned, which is a contradiction. Conversely, if the algorithm does continue 
until reaching node m, then it is clear that a connecting path can be constructed 
backward as outlined above. 


Capacitated Networks 


In some network applications it is useful to assume that there are upper bounds 
on the allowable flow in various arcs. This motivates the concept of a capacitated 
network. 


Definition. A capacitated network is a network in which some arcs are 
assigned nonnegative capacities, which define the maximum allowable flow 
in those arcs. The capacity of an are (i, j) is denoted k;;, and this capacity is 
indicated on the graph by placing the number k;; adjacent to the arc. 


Throughout this section all capacities are assumed to be nonnegative integers. 
Figure 6.5 shows an example of a network with the capacities indicated. Thus the 
capacity from node | to node 2 is 12, while that from node 2 to node | is 6. 


The Maximal Flow Problem 


Consider a capacitated network in which two special nodes, called the source and 
the sink, are distinguished. Say they are nodes | and m, respectively. All other 
nodes must satisfy the strict conservation requirement; that is, the net flow into 
these nodes must be zero. However, the source may have a net outflow and the 
sink a net inflow. The outflow f of the source will equal the inflow of the sink as 
a consequence of the conservation at all other nodes. A set of arc flows satisfying 
these conditions is said to be a flow in the network of value f. The maximal flow 


a 


Fig. 6.5 A network with capacities 
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problem is that of determining the maximal flow that can be established in such a 
network. When written out, it takes the form 


minimize f 


subject to )) x4;-)> x1 —f =0 


j=l j=l 


y xi3- >) X ji = (); iAl, m (15) 
j=l j=l 
Xnj y Xm ae =0 
j=l j=l 
O< x, <i, all i, j, 


where only those i, j pairs corresponding to arcs are allowed. 

The problem can be expressed more compactly in terms of the node—arc 
incidence matrix. Let x be the vector of are flows x;; (ordered in any way). Let 
A be the corresponding node—arc incidence matrix. Finally, let e be a vector with 
dimension equal to the number of nodes and having a +1 component on node I, a 


—1 on node m, and all other components zero. The maximal flow problem is then 


maximize f 
subject to Ax—fe=0 (16) 


x <k. 


The coefficient matrix of this problem is equal to the node—arc incidence matrix with 
an additional column for the flow variable f. Any basis of this matrix is triangular, 
and hence as indicated by the theory in the earlier part of this chapter, the simplex 
method can be effectively employed to solve this problem. However, instead of 
the simplex method, a more efficient algorithm based on the tree algorithm can 
be used. 

The basic strategy of the algorithm is quite simple. First we recognize that it is 
possible to send nonzero flow from node | to node m only if node m is reachable 
from node |. The tree procedure of the previous section can be used to determine 
if m is in fact reachable; and if it is reachable, the algorithm will produce a path 
from 1 to m. By examining the arcs along this path, we can determine the one 
with minimum capacity. We may then construct a flow equal to this capacity from 
1 to m by using this path. This gives us a strictly positive (and integer-valued) 
initial flow. 

Next consider the nature of the network at this point in terms of additional 
flows that might be assigned. If there is already flow x;, in the arc (i, j), then the 
effective capacity of that arc is reduced by x;; (to k;; — x;;), since that is the maximal 
amount of additional flow that can be assigned to that arc. On the other hand, the 
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effective reverse capacity, on the arc (j, i), is increased by x;; (to kj; +.x;;), since a 
small incremental backward flow is actually realized as a reduction in the forward 
flow through that arc. Once these changes in capacities have been made, the tree 
procedure can again be used to find a path from node | to node m on which to 
assign additional flow. (Such a path is termed an augmenting path.) Finally, if m 
is not reachable from 1, no additional flow can be assigned, and the procedure is 
complete. 

It is seen that the method outlined above is based on repeated application of 
the tree procedure, which is implemented by labeling and scanning. By including 
slightly more information in the labels than in the basic tree algorithm, the minimum 
arc capacity of the augmenting path can be determined during the initial scanning, 
instead of by reexamining the arcs after the path is found. A typical label at a node 
i has the form (k, c;), where k denotes a precursor node and c; is the maximal flow 
that can be sent from the source to node i through the path created by the previous 
labeling and scanning. The complete procedure is this: 


Step 0. Set all x;; =0 and f =0. 
Step 1. Label node 1 (—, co). All other nodes are unlabeled. 


Step 2. Select any labeled node i for scanning. Say it has label (k, c;). For all 
unlabeled nodes j such that (i, j) is an arc with x;; < k;;, assign the label (i, c;), 
where c; = min {c;,k;; — x;;}. For all unlabeled nodes j such that (j,i) is an arc 
with x; > 0, assign the label (i, c;), where c; = min {c;, x;;}. 


Step 3. Repeat Step 2 until either node m is labeled or until no more labels can 
be assigned. In this latter case, the current solution is optimal. 


Step 4. (Augmentation.) If the node m is labeled (i, c,,), then increase f and the 
flow on arc (i,m) by c,,. Continue to work backward along the augmenting path 
determined by the nodes, increasing the flow on each arc of the path by c,,. Return 
to Step 1. 


m* 


The validity of the algorithm should be fairly apparent. However, a complete 
proof is deferred until we consider the max flow—min cut theorem below. Never- 
theless, the finiteness of the algorithm is easily established. 


Proposition. The maximal flow algorithm converges in at most a finite number 
of iterations. 


Proof. (Recall our assumption that all capacities are nonnegative integers.) Clearly, 
the flow is bounded—at least by the sum of the capacities. Starting with zero flow, 
the minimal available capacity at every stage will be an integer, and accordingly, 
the flow will be augmented by an integer amount at every step. This process must 
terminate in a finite number of steps, since the flow is bounded. J 


Example. An example of the above procedure is shown in Fig. 6.6. Node 1 is the 
source, and node 6 is the sink. The original network with capacities indicated on the 
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d, 1) (2, 1) 


Fig. 6.6 Example of maximal flow problem 
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arcs is shown in Fig. 6.6(a). Also shown in that figure are the initial labels obtained 
by the procedure. In this case the sink node is labeled, indicating that a flow of | 
unit can be achieved. The augmenting path of this flow is shown in Fig. 6.6(b). 
Numbers in square boxes indicate the total flow in an arc. The new labels are then 
found and added to that figure. Note that node 2 cannot be labeled from node | 
because there is no unused capacity in that direction. Node 2 can, however, be 
labeled from node 4, since the existing flow provides a reverse capacity of 1 unit. 
Again the sink is labeled, and | unit more flow can be constructed. The augmenting 
path is shown in Fig. 6.6(c). A new labeling is appended to that figure. Again the 
sink is labeled, and an additional 1 unit of flow can be sent from source to sink. 
The path of this 1 unit is shown in Fig. 6.6(d). Note that it includes a flow from 
node 4 to node 2, even though flow was not allowed in this direction in the original 
network. This flow is allowable now, however, because there is already flow in the 
opposite direction. The total flow at this point is shown in Fig. 6.6(e). The flow 
levels are again in square boxes. This flow is maximal, since only the source node 
can be labeled. 


The efficiency of the maximal flow algorithm can be improved by various 
refinements. For example, a considerable gain in efficiency can be obtained by 
applying the tree algorithm in first-labeled, first-scanned mode. Further discussion 
of these points can be found in the references cited at the end of the chapter. 


Max Flow—Min Cut Theorem 


A great deal of insight and some further results can be obtained through the 
introduction of the notion of cuts in a network. Given a network with source node 
1 and sink node m, divide the nodes arbitrarily into two sets S and S such that the 
source node is in S and the sink is in S. The set of arcs from S$ to S is a cut and is 
denoted (S, 5). The capacity of the cut is the sum of the capacities of the arcs in 
the cut. 


Fig. 6.7 A cut 
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An example of a cut is shown in Fig. 6.7. The set S consists of nodes 1 and 2, 
while S consists of 3, 4, 5, 6. The capacity of this cut is 4. 

It should be clear that a path from node | to node m must include at least 
one arc in any cut, for the path must have an arc from the set S to the set S. 
Furthermore, it is clear that the maximal amount of flow that can be sent through 
a cut is equal to its capacity. Thus each cut gives an upper bound on the value of 
the maximal flow problem. The max flow—min cut theorem states that equality is 
actually achieved for some cut. That is, the maximal flow is equal to the minimal 
cut capacity. It should be noted that the proof of the theorem also establishes the 
maximality of the flow obtained by the maximal flow algorithm. 


Max Flow-Min Cut Theorem. In a network the maximal flow between a source 
and a sink is equal to the minimal cut capacity of all cuts separating the source 
and sink. 


Proof. Since any cut capacity must be greater than or equal to the maximal flow, 
it is only necessary to exhibit a flow and a cut for which equality is achieved. 
Begin with a flow in the network that cannot be augmented by the maximal flow 
algorithm. For this flow find the effective arc capacities of all arcs for incremental 
flow changes as described earlier and apply the labeling procedure of the maximal 
flow algorithm. Since no augmenting path exists, the algorithm must terminate 
before the sink is labeled. 

Let § and S$ consist of all labeled and unlabeled nodes, respectively. This 
defines a cut separating the source from the sink. All arcs originating in S and 
terminating in S have zero incremental capacity, or else a node in S could have 
been labeled. This means that each arc in the cut is saturated by the original flow; 
that is, the flow is equal to the capacity. Any arc originating in § and terminating in 
S, on the other hand, must have zero flow; otherwise, this would imply a positive 
incremental capacity in the reverse direction, and the originating node in S would 
be labeled. Thus, there is a total flow from S to $ equal to the cut capacity, and zero 
flow from S to S. This means that the flow from source to sink is equal to the cut 
capacity. Thus the cut capacity must be minimal, and the flow must be maximal. J 


In the network of Fig. 6.6, the minimal cut corresponds to the S consisting 
only of the source. That cut capacity is 3. Note that in accordance with the max 
flow—min cut theorem, this is equal to the value of the maximal flow, and the 
minimal cut is determined by the final labeling in Fig. 6.6(e). In Fig. 6.7 the cut 
shown is also minimal, and the reader should easily be able to determine the pattern 
of maximal flow. 


Duality 


The character of the max flow—min cut theorem suggests a connection with the 
Duality Theorem. We conclude this section by briefly exploring this connection. 

The maximal flow problem is a linear program, which is expressed formally 
by (16). The dual problem is found to be 
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minimize wk 


subject to u’A=w! (17) 
we=1 
w>0 


When written out in detail, the dual is 
minimize )?) w;,k;j 


subject to uj; —u; = Wj; 


u,;—Uu,, = 1 (18) 


w;; = 0. 


A pair i, j is included in the above only if (i, j) is an arc of the network. 
_A feasible solution to this dual problem can be found in terms of any cut set 
(S, S$). In particular, it is easily seen that 


_jl if ies (19) 
0 if ieS 
f if (i, j)€(S, S) 
wij = 


0 otherwise 


is a feasible solution. The value of the dual problem corresponding to this solution 
is the cut capacity. If we take the cut set to be the one determined by the labeling 
procedure of the maximal flow algorithm as described in the proof of the theorem 
above, it can be seen to be optimal by verifying the complementary slackness 
conditions (a task we leave to the reader). The minimum value of the dual is 
therefore equal to the minimum cut capacity. 


6.9 SUMMARY 


Problems of special structure are important both for applications and for theory. 
The transportation problem represents an important class of linear programs with 
structural properties that lead to an efficient implementation of the simplex method. 
The most important property of the transportation problem is that any basis is 
triangular. This means that the basic variables can be found, one by one, directly 
by back substitution, and the basis need never be inverted. Likewise, the simplex 
multipliers can be found by back substitution, since they solve a set of equations 
involving the transpose of the basis. 

Since all elements of the basis are either zero or one, it follows that all basic 
variables will be integers if the requirements are integers, and all simplex multipliers 
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will be integers if the cost coefficients are integers. When a new variable with 
a value @ is to be brought into the basis, the change in all other basic variables 
will be either +0, —0, or 0, again because of the structural properties of the basis. 
This leads to a cycle of change, which amounts to shipping an amount @ of the 
commodity around a cycle on the transportation system. All necessary computations 
for solution of the transportation problem can be carried out on arrays of solutions 
or of cost coefficients. The primary operations are row and column scanning, which 
implement the back substitution process. 

The assignment problem is a case of the transportation problem with additional 
structure. Every solution is highly degenerate, having only n positive values instead 
of the 2m — 1 that would appear in a nondegenerate solution. 

Network flow problems represent another important class of linear 
programming problems. The transportation problem can be generalized to a 
minimum cost flow problem in a network. This leads to the interpretation of a 
simplex basis as corresponding to a spanning tree in the network. 

Another fundamental network problem is that of determining whether it is 
possible to construct a path of arcs to a specified destination node from a given 
origin node. This problem can be efficiently solved using the tree algorithm. This 
algorithm progresses by fanning out from the origin, first determining all nodes 
reachable in one step, then all nodes reachable in one step from these, and so forth 
until the specified destination is attained or it is not possible to continue. 

The maximal flow problem is that of determining the maximal flow from an 
origin to a destination in a network with capacity constraints on the flow in each 
arc. This problem can be solved by repeated application of the tree algorithm, 
successively determining paths from origin to destination and assigning flow along 
such paths. 


6.10 EXERCISES 


1. Using the Northwest Corner Rule, find basic feasible solutions to transportation problems 
with the following requirements: 


a) a=(10,15,7,8) b=(8,6,9, 12,5) 
b) a=(2,3,4,5,6) b=(6,5,4,3,2) 
c) a= (2,4,3,1,5,2) b=(6,4,2, 3,2) 


2. Transform the following to lower triangular form, or show that such transformation is 
not possible. 
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3. A matrix A is said to be totally unimodular if the determinant of every square submatrix 
formed from it has value 0, +1, or —1. 


176 Chapter 6 Transportation and Network Flow Problems 


a) Show that the matrix A defining the equality constraints of a transportation problem 
is totally unimodular. 

b) In the system of equations Ax = b, assume that A is totally unimodular and that 
all elements of A and b are integers. Show that all basic solutions have integer 
components. 


4. For the arrays below: 


a) Compute the basic solutions indicated. (Note: They may be infeasible.) 
b) Write the equations for the basic variables, corresponding to the indicated basic 
solutions, in lower triangular form. 


x x 10 x x 10 

20 x 20 

x x | 30 x x | 30 
20 | 20 | 20 20 | 20 | 20 


5. For the arrays of cost coefficients below, the circled positions indicate basic variables. 


a) Compute the simplex multipliers. 
b) Write the equations for the simplex multipliers in upper triangular form, and compare 
with Part (b) of Exercise 4. 


3 © @ @6 @ 
2 @ 3 
@ 5 @ 1 © ® 


6. Consider the modified transportation problem where there is more available at origins 
than is required at destinations: 


m 


n 
minimize Ye capxiy 
j=l i=l 
n 
subject to Nes XjKaq;, i=l, 2, ..., m 
j=l 
n 
xj=b;, j=l, 2, ,n 


a0, call 7,3, 


m 


n 
where > a>>5 5 
j=l 


i=l 


a) Show how to convert it to an ordinary transportation problem. 
b) Suppose there is a storage cost of s; per unit at origin i for goods not transported to 
a destination. Repeat Part (a) with this assumption. 


7. Solve the following transportation problem, which is an original example of Hitchcock. 


10 5 6 7 
C= 8 2 7 6 
9 3 4 8 


a=(25 25 50) 
b=(15 20 30 35) 


10. 


11. 


12. 


13. 
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In a transportation problem, suppose that two rows or two columns of the cost coefficient 
array differ by a constant. Show that the problem can be reduced by combining those 
rows or columns. 


The transportation problem is often solved more quickly by carefully selecting the 
starting basic feasible solution. The matrix minimum technique for finding a starting 
solution is: (1) Find the lowest cost unallocated cell in the array, and allocate the 
maximum possible to it, (2) Reduce the corresponding row and column requirements, 
and drop the row or column having zero remaining requirement. Go back to Step 1 
unless all remaining requirements are zero. 


a) Show that this procedure yields a basic feasible solution. 
b) Apply the method to Exercise 7. 


The caterer problem. A caterer is booked to cater a banquet each evening for the next T 
days. He requires r, clean napkins on the ¢th day for t= 1, 2,..., 7. He may send dirty 
napkins to the laundry, which has two speeds of service—fast and slow. The napkins 
sent to the fast service will be ready for the next day’s banquet; those sent to the slow 
service will be ready for the banquet two days later. Fast and slow service cost c, and 
Cy per napkin, respectively, with c; > cy. The caterer may also purchase new napkins 
at any time at cost cy. He has an initial stock of s napkins and wishes to minimize the 
total cost of supplying fresh napkins. 


a) Formulate the problem as a transportation problem. (Hint: Use T+ 1 sources and T 
destinations.) 

b) Using the values T = 4, s = 200, r, = 100, 7, = 130, r, = 150, r, = 140, c, = 6, c, = 
4, Cy = 12, solve the problem. 


The marriage problem. A group of n men and n women live on an island. The amount of 
happiness that the ith man and the jth woman derive by spending a fraction x;; of their 
lives together is c;;x;;. What is the nature of the living arrangements that maximizes the 
total happiness of the islanders? 


Shortest route problem. Consider a system of n points with distance c;; between points 
i and j. We wish to find the shortest path from point 1 to point n. 


a) Show how to formulate the problem as an n node minimal cost flow problem. 
b) Show how to convert the problem to an equivalent assignment problem of dimension 
n—-l. 


Transshipment I. The general minimal cost flow problem of Section 6.7 can be converted 
to a transportation problem and thus solved by the transportation algorithm. One way to 
do this conversion is to find the minimum cost path from every supply node to every 
demand node, allowing for possible shipping through intermediate transshipment nodes. 
The values of these minimum costs become the effective point-to-point costs in the 
equivalent transportation problem. Once the transportation problem is solved, yielding 
amounts to be shipped from origins to destinations, the result is translated back to flows 
in arcs by shipping along the previously determined minimal cost paths. 

Consider the transshipment problem with five shipping points defined by the 
symmetric cost matrix and the requirements indicated below 


s=(10, 30, 0, —20, —20) 
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In this system points 1 and 2 are net suppliers, points 4 and 5 are net demanders, and 


point 3 is neither. Any of the points may serve as transshipment points. That is, it is not 
necessary to ship directly from one node to another; any path is allowable. 
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a) Show that the above problem is equivalent to the transportation problem defined by 
the arrays below, and solve this problem. 


4 > a 
1 10 
m7 c-(> ! 
b | 20 | 20 


b) Find the optimal flows in the original network. 


14. Transshipment II. Another way to convert a transshipment problem to a transportation 
problem is through the introduction of buffer stocks at each node. A transshipment 
can then be replaced by a series of direct shipments, where the buffer stocks from 
intermediate points are shipped ahead but then replenished when other shipments arrive. 

Suppose the original problem had n nodes with supply values b;, i= 1,2,...,n, 
with }° b; = 0. In the equivalent problem there are n origin nodes with supply B and n 
destination nodes with value B+ );. B is the buffer level (sufficiently large). 

Using this method and B = 40, the problem in Exercise 13 can be formulated as a 
5 x 5 transportation problem with supplies (40, 40, 40, 40, 40) and demands (50, 70, 
40, 20, 20). Solve this problem. Throw away all diagonal terms (which represent buffer 
changes) to obtain the solution of the original problem. 


15. Transshipment III. Solve the problem of Exercise 13 using the method of Section 6.7. 


16. Apply the maximal flow algorithm to the network below. All arcs have capacity | unless 
otherwise indicated. 
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Chapter7 BASIC PROPERTIES 
OF SOLUTIONS 
AND ALGORITHMS 


In this chapter we consider optimization problems of the form 


minimize f(x) (1) 
subject to xeQ, 

where f is a real-valued function and Q, the feasible set, is a subset of E”. 
Throughout most of the chapter attention is restricted to the case where 0 = E", 
corresponding to the completely unconstrained case, but sometimes we consider 
cases where Q is some particularly simple subset of E”. 

The first and third sections of the chapter characterize the first- and second- 
order conditions that must hold at a solution point of (1). These conditions are 
simply extensions to E” of the well-known derivative conditions for a function of 
a single variable that hold at a maximum or a minimum point. The fourth and 
fifth sections of the chapter introduce the important classes of convex and concave 
functions that provide zeroth-order conditions as well as a natural formulation for a 
global theory of optimization and provide geometric interpretations of the derivative 
conditions derived in the first two sections. 

The final sections of the chapter are devoted to basic convergence charac- 
teristics of algorithms. Although this material is not exclusively applicable to 
optimization problems but applies to general iterative algorithms for solving 
other problems as well, it can be regarded as a fundamental prerequisite for a 
modern treatment of optimization techniques. Two essential questions are addressed 
concerning iterative algorithms. The first question, which is qualitative in nature, is 
whether a given algorithm in some sense yields, at least in the limit, a solution to the 
original problem. This question is treated in Section 7.6, and conditions sufficient to 
guarantee appropriate convergence are established. The second question, the more 
quantitative one, is related to how fast the algorithm converges to a solution. This 
question is defined more precisely in Section 7.7. Several special types of conver- 
gence, which arise frequently in the development of algorithms for optimization, 
are explored. 
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7.1 FIRST-ORDER NECESSARY CONDITIONS 


Perhaps the first question that arises in the study of the minimization problem 
(1) is whether a solution exists. The main result that can be used to address 
this issue is the theorem of Weierstras, which states that if f is continuous and 
Q, is compact, a solution exists (see Appendix A.6). This is a valuable result 
that should be kept in mind throughout our development; however, our primary 
concern is with characterizing solution points and devising effective methods for 
finding them. 

In an investigation of the general problem (1) we distinguish two kinds of 
solution points: local minimum points, and global minimum points. 


Definition. A point x* € © is said to be a relative minimum point or a local 
minimum point of f over ©. if there is an ¢ > 0 such that f(x) > f(x*) for all 
x € QO within a distance ¢ of x* (that is, x € O and |x —x*| < «). If f(x) > f(x*) 
for all x € O, x 4x", within a distance e of x*, then x* is said to be a strict 
relative minimum point of f over ©. 


Definition. A point x* € Q is said to be a global minimum point of f over 
QO, if f(x) > f(x") for all x € O. If f(x) > f(x*) for all x e O, x 4 x", then x* 
is said to be a strict global minimum point of f over Q.. 


In formulating and attacking problem (1) we are, by definition, explicitly asking 
for a global minimum point of f over the set ©. Practical reality, however, both 
from the theoretical and computational viewpoint, dictates that we must in many 
circumstances be content with a relative minimum point. In deriving necessary 
conditions based on the differential calculus, for instance, or when searching for 
the minimum point by a convergent stepwise procedure, comparisons of the values 
of nearby points is all that is possible and attention focuses on relative minimum 
points. Global conditions and global solutions can, as a rule, only be found if the 
problem possesses certain convexity properties that essentially guarantee that any 
relative minimum is a global minimum. Thus, in formulating and attacking problem 
(1) we shall, by the dictates of practicality, usually consider, implicitly, that we are 
asking for a relative minimum point. If appropriate conditions hold, this will also 
be a global minimum point. 


Feasible Directions 


To derive necessary conditions satisfied by a relative minimum point x*, the basic 
idea is to consider movement away from the point in some given direction. Along 
any given direction the objective function can be regarded as a function of a single 
variable, the parameter defining movement in this direction, and hence the ordinary 
calculus of a single variable is applicable. Thus given x € 2, we are motivated to say 
that a vector d is a feasible direction at x if there is an @ > 0 such that x+adeQ 
for all a, 0< a <a. With this simple concept we can state some simple conditions 
satisfied by relative minimum points. 
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Proposition 1 (First-order necessary conditions). Let O be a subset of E" and 
let f € C! be a function on Q. If x* is a relative minimum point of f over QO, 
then for any d € E" that is a feasible direction at x*, we have V f(x*)d > 0. 


Proof. For any a,0<a< @, the point x(a) = x*+ad € Q. For0< a< a define 
the function g(a) = f(x(q@)). Then g has a relative minimum at a = 0. A typical g 
is shown in Fig. 7.1. By the ordinary calculus we have 


g(a) — g(0) = g'(O)a+o(a), (2) 


where o(a) denotes terms that go to zero faster than a@ (see Appendix A). If 
g'(0) <0 then, for sufficiently small values of a > 0, the right side of (2) will be 
negative, and hence g(a) — g(0) < 0, which contradicts the minimal nature of g(0). 
Thus g’(0) = Vf(x*)d > 0.] 


A very important special case is where x* is in the interior of Q (as would be 
the case if O = E”). In this case there are feasible directions emanating in every 
direction from x*, and hence V f(x*)d > 0 for all d € E”. This implies Vf(x*) = 0. 
We state this important result as a corollary. 


Corollary. (Unconstrained case). Let O be a subset of E", and let f € C' be 
a function’ on Q.. If x* is a relative minimum point of f over ©. and if x* is an 
interior point of Q, then V f(x*) = 9. 


The necessary conditions in the pure unconstrained case lead to n equations 
(one for each component of V/) in 2 unknowns (the components of x*), which 
in many cases can be solved to determine the solution. In practice, however, as 
demonstrated in the following chapters, an optimization problem is solved directly 
without explicitly attempting to solve the equations arising from the necessary 
conditions. Nevertheless, these conditions form a foundation for the theory. 


g(a) 
A 


slope >0 


Fig. 7.1 Construction for proof 
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Example 1. Consider the problem 
minimize f(x,, x) = x} — x x) +x5—3x. 


There are no constraints, so = E’. Setting the partial derivatives of f equal to 
zero yields the two equations 


2x; — X,=0 
—x, + 2x, =3. 


These have the unique solution x, = 1, x, = 2, which is a global minimum point of f. 


Example 2. Consider the problem 


minimize f(x, x.) =x, —Xy +X) +X) % 
subject to x, >0, x, 20. 


This problem has a global minimum at x, = 5, x, =0. At this point 
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Thus, the partial derivatives do not both vanish at the solution, but since any 
feasible direction must have an x, component greater than or equal to zero, we have 
V f(x*)d > 0 for all d € E such that d is a feasible direction at the point (1/2, 0). 


7.2 EXAMPLES OF UNCONSTRAINED PROBLEMS 


Unconstrained optimization problems occur in a variety of contexts, but most 
frequently when the problem formulation is simple. More complex formula- 
tions often involve explicit functional constraints. However, many problems with 
constraints are frequently converted to unconstrained problems by using the 
constraints to establish relations among variables, thereby reducing the effective 
number of variables. We present a few examples here that should begin to indicate 
the wide scope to which the theory applies. 


Example 1 (Production). A common problem in economic theory is the deter- 
mination of the best way to combine various inputs in order to produce a certain 
commodity. There is a known production function f(x,,x,,...,x,,) that gives the 
amount of the commodity produced as a function of the amounts x; of the inputs, 
i=1,2,...,n. The unit price of the produced commodity is g, and the unit prices 
of the inputs are p,, Po,..-.,P,- The producer wishing to maximize profit must 
solve the problem 


maximize qf(X,, X5,--+, X_,) —PyX1 — PoXo ++ — Dy Xp: 


7.2 Examples of Unconstrained Problems 187 


The first-order necessary conditions are that the partial derivatives with respect 
to the x,’s each vanish. This leads directly to the n equations 


0 
se, Noryeacar'y, Ky) —Diy ii ee RR ( 
Ox; 


These equations can be interpreted as stating that, at the solution, the marginal 
value due to a small increase in the ith input must be equal to the price p,. 


Example 2 (Approximation). A common use of optimization is for the purpose 
of function approximation. Suppose, for example, that through an experiment 


the value of a function g is observed at m points, x,,x,,...,X,,. Thus, values 
9(X,), g(x), ..-, 9(X,,) are known. We wish to approximate the function by a 
polynomial 


h(x) =a,x"+a,_)x" | +...+a9 


of degree n (or less), where n < m. Corresponding to any choice of the approximating 
polynomial, there will be a set of errors €, = g(x,) — h(x,). We define the best approx- 
imation as the polynomial that minimizes the sum of the squares of these errors; that 
is, Minimizes 


m 


Vile. 


This in turn means that we minimize 


m 


fla) = Viste) — Gnxp + an ite) H.-F ao) P 


k=1 


with respect to a = (dp, d),...,a,,) to find the best coefficients. This is a quadratic 
expression in the coefficients: a. To find a compact representation for this objective 
we define q;; = 2 (a), b= > 2(x,)(x,)/ and c = > g(x,)7. Then after a bit of 


algebra it can be shown that 
f(a) =a’ Qa—2b’a+c 
where Q = [qij]. b= (b,, by, ..., 5,4). 
The first-order necessary conditions state that the gradient of f must vanish. This 
leads directly to the system of n+ | equations 


Qa =b. 


These can be solved to determine a. 
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Example 3 (Selection problem). It is often necessary to select an assortment of 
factors to meet a given set of requirements. An example is the problem faced by 
an electric utility when selecting its power-generating facilities. The level of power 
that the company must supply varies by time of the day, by day of the week, and 
by season. Its power-generating requirements are summarized by a curve, h(x), as 
shown in Fig. 7.2(a), which shows the total hours in a year that a power level of at 
least x is required for each x. For convenience the curve is normalized so that the 
upper limit is unity. 

The power company may meet these requirements by installing generating 
equipment, such as (1) nuclear or (2) coal-fired, or by purchasing power from a 
central energy grid. Associated with type i (i = 1,2) of generating equipment is 
a yearly unit capital cost b; and a unit operating cost c;. The unit price of power 
purchased from the grid is c;. 

Nuclear plants have a high capital cost and low operating cost, so they are 
used to supply a base load. Coal-fired plants are used for the intermediate level, 
and power is purchased directly only for peak demand periods. The requirements 
are satisfied as shown in Fig. 7.2(b), where x, and x, denote the capacities of the 
nuclear and coal-fired plants, respectively. (For example, the nuclear power plant 
can be visualized as consisting of x,/A small generators of capacity A, where A is 
small. The first such generator is on for about (A) hours, supplying Ah(A) units 
of energy; the next supplies AA(2A) units, and so forth. The total energy supplied 
by the nuclear plant is thus the area shown.) 

The total cost is 


xy 
F(X my) = yx tye +e, | h(x) dx 
0 


Xy +x 1 
+c | | h(x) dx+c; / h(x) dx, 
Xi +x2 
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> > xX 
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Fig. 7.2 Power requirements curve 
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and the company wishes to minimize this over the set defined by 
x, 20, xX, >0, xX, +% <1. 


Assuming that the solution is interior to the constraints, by setting the partial 
derivatives equal to zero, we obtain the two equations 


Dy + (Cy = Cy) A(X) + (Cy — C3) A(X +X) = 0 
by + (Cc) — ¢3) h(x, +x) = 0, 


which represent the necessary conditions. 

If x, = 0, then the general necessary condition theorem shows that the first 
equality could relax to > 0. Likewise, if x, = 0, then the second equality could 
relax to > 0. The case x, +x, = | requires a bit more analysis (see Exercise 2). 


Example 4 (Control). Dynamic problems, where the variables correspond to 
actions taken at a sequence of time instants, can often be formulated as unconstrained 
optimization problems. As an example suppose that the position of a large object is 
controlled by a series of corrective control forces. The error in position (the distance 
from the desired position) is governed by the equation 


Xi = Xe t Uys 


where x, is the error at time instant k, and u, is the effective force applied at time 
u, (after being normalized to account for the mass of the object and the duration of 
the force). The value of x, is given. The sequence up, u,,..., u,, Should be selected 
so as to minimize the objective 


J =) {xp + uj}. 
k=0 


This represents a compromise between a desire to have x, equal to zero and 
recognition that control action u, is costly. 

The problem can be converted to an unconstrained problem by eliminating the 
x, variables, k = 1,2,...,n, from the objective. It is readily seen that 


Xp =X FU FU He Uy. 
The objective can therefore be rewritten as 
J= DLl@o tug +++ + Uy)? + uy}. 
k=0 


This is a quadratic function in the unknowns u,. It has the same general structure 
as that of Example 2 and it can be treated in a similar way. 
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7.3 SECOND-ORDER CONDITIONS 


The proof of Proposition 1 in Section 7.1 is based on making a first-order approx- 
imation to the function f in the neighborhood of the relative minimum point. 
Additional conditions can be obtained by considering higher-order approximations. 
The second-order conditions, which are defined in terms of the Hessian matrix V7 f 
of second partial derivatives of f (see Appendix A), are of extreme theoretical 
importance and dominate much of the analysis presented in later chapters. 


Proposition 1 (Second-order necessary conditions). Let Q, be a subset of E" 
and let f € C* be a function on Q. If x* is a relative minimum point of f over 
Q,, then for any d € E” that is a feasible direction at x* we have 


i) Vf(x*)d>0 (3) 
ii) if Vf(x*)d=0, then d’ V? f(x*)d > 0. (4) 


Proof. The first condition is just Proposition 1, and the second applies only if 
Vf(x*)d = 0. In this case, introducing x(a) = x*+ ad and g(a) = f(x(a@)) as 
before, we have, in view of g’(0) = 0, 


g(@) — g(0) = 58" (0)a* + ofa’). 


If g’(0) < 0 the right side of the above equation is negative for sufficiently small 
a which contradicts the relative minimum nature of g(0). Thus 


g'(0) = d'V" f(x*)d > 0.1 


Example 1. For the same problem as Example 2 of Section 7.1, we have for 
d= (d,, dy) 


Vf(x")d = 3d). 


Thus condition (ii) of Proposition 1 applies only if d, = 0. In that case we have 
d’ V’ f(x*)d = 2d} > 0, so condition (ii) is satisfied. 

Again of special interest is the case where the minimizing point is an interior 
point of ©, as, for example, in the case of completely unconstrained problems. We 
then obtain the following classical result. 


Proposition 2 (Second-order necessary conditions—unconstrained case). 
Let x* be an interior point of the set 0, and suppose x* is a relative minimum 
point over Q, of the function f € C?. Then 


i) V(x") =0 (5) 
ii) for alld, d’V?f(x*)d>0. (6) 
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For notational simplicity we often denote V7 f(x), the n x n matrix of the second 
partial derivatives of f, the Hessian of f, by the alternative notation F(x). Condition 
(ii) is equivalent to stating that the matrix F(x*) is positive semidefinite. As we 
shall see, the matrix F(x*), which arises here quite naturally in a discussion of 
necessary conditions, plays a fundamental role in the analysis of iterative methods 
for solving unconstrained optimization problems. The structure of this matrix is the 
primary determinant of the rate of convergence of algorithms designed to minimize 
the function f. 


Example 2. Consider the problem 


minimize f(x,, x5) =x} — xp x, +2x5 
subject to x, >0, x, 20. 


If we assume that the solution is in the interior of the feasible set, that is, if 
x, > 0, x, > 0, then the first-order necessary conditions are 


3x} —2x,x, =0, —x7+4x,=0. 


There is a solution to these at x; = x, = 0 which is a boundary point, but there is 
also a solution at x, = 6, x, =9. We note that for x, fixed at x, = 6, the objective 
attains a relative minimum with respect to x, at x, = 9. Conversely, with x, fixed 
at x, =9, the objective attains a relative minimum with respect to x, at x, = 6. 
Despite this fact, the point x, = 6, x, =9 is not a relative minimum point, because 
the Hessian matrix is 


_ | 6x,—-2x, —2x, 
= i 4 |’ 


which, evaluated at the proposed solution x, = 6, x, = 9, is 


18 —12 
F= : 
—12 4 
This matrix is not positive semidefinite, since its determinant is negative. Thus the 
proposed solution is not a relative minimum point. 


Sufficient Conditions for a Relative Minimum 


By slightly strengthening the second condition of Proposition 2 above, we obtain a 
set of conditions that imply that the point x* is a relative minimum. We give here 
the conditions that apply only to unconstrained problems, or to problems where the 
minimum point is interior to the feasible region, since the corresponding conditions 
for problems where the minimum is achieved on a boundary point of the feasible 
set are a good deal more difficult and of marginal practical or theoretical value. A 
more general result, applicable to problems with functional constraints, is given in 
Chapter 11. 
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Proposition 3 (Second-order sufficient conditions—unconstrained case). 
Let f € C? be a function defined on a region in which the point x* is an interior 
point. Suppose in addition that 


i) Vf(x") =0 (7) 
ii) F(x*) is positive definite. (8) 
Then x* is a strict relative minimum point of f. 
Proof. Since F(x*) is positive definite, there is an a > 0 such that for all 
d, d’F(x*)d > ald|?. Thus by the Taylor’s Theorem (with remainder) 
f(x" +d) — f(x") = 3d" F(x")d + (||?) 
> (a/2)|d|* + o(\d |’) 


For small |d| the first term on the right dominates the second, implying that both 
sides are positive for small d. J 


7.4 CONVEX AND CONCAVE FUNCTIONS 


In order to develop a theory directed toward characterizing global, rather than local, 
minimum points, it is necessary to introduce some sort of convexity assumptions. 
This results not only in a more potent, although more restrictive, theory but also 
provides an interesting geometric interpretation of the second-order sufficiency 
result derived above. 


Definition. A function f defined on a convex set Q is said to be convex if, 
for every x,, X, € 0 and every a, 0O< a < 1, there holds 


f(ax, + (1 — @)x,) < af(x,) + (1— @) f(x). 


If, for every a, 0 < a < 1, and x, #X,, there holds 


f(ax, + (1 — @)x,) < af(x,) +(1— @) f(x), 


then f is said to be strictly convex. 


Several examples of convex or nonconvex functions are shown in Fig. 7.3. 
Geometrically, a function is convex if the line joining two points on its graph lies 
nowhere below the graph, as shown in Fig. 7.3(a), or, thinking of a function in two 
dimensions, it is convex if its graph is bowl shaped. 

Next we turn to the definition of a concave function. 


Definition. A function g defined on a convex set © is said to be concave 
if the function f = —g is convex. The function g is strictly concave if —g is 
strictly convex. 
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Fig. 7.3 Convex and nonconvex functions 
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Combinations of Convex Functions 


We show that convex functions can be combined to yield new convex functions 
and that convex functions when used as constraints yield convex constraint sets. 


Proposition 1. Let f; and f, be convex functions on the convex set O.. Then 
the function f, + fo is convex on ©. 


Proof. Let x,, x, € ©, and 0 <a <1. Then 


fi (ax, + 1 — @)x,)+f,(ax,) + (1 — @)Xx) 
<alfi(x)+A&)]+0—-a)[fi)+A))-T 


Proposition 2. Let f be a convex function over the convex set Q. Then the 
function af is convex for any a > 0. 


Proof. Immediate. J 
Note that through repeated application of the above two propositions it follows 
that a positive combination a, f, +a.f,+...+4,,f,, of convex functions is again 


convex. 
Finally, we consider sets defined by convex inequality constraints. 


Proposition 3. Let f be a convex function on a convex set Q, The set 
T. = {x: x € Q, f(x) < c} is convex for every real number c. 


Proof. Let x,, x, €[.. Then f(x,) <c, f(x,) < ¢ and for0 <a <1, 
f(a@x; + (l= @)x2) < af(x;) + (lL — @) ft) <e. 


Thus ax, +(1—a)x, €T..] 


We note that, since the intersection of convex sets is also convex, the set of 
points simultaneously satisfying 


fi®) <c, FAS) Cay sea Fy BD SCs 


where each f; is a convex function, defines a convex set. This is important in 
mathematical programming, since the constraint set is often defined this way. 


Properties of Differentiable Convex Functions 


If a function f is differentiable, then there are alternative characterizations of 
convexity. 
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Proposition 4. Let f ¢ C'. Then f is convex over a convex set Q, if and only 


if 
Sly) 2 f(&) + VAX) (y— x) (9) 


forallx,yeQ. 


Proof. | First suppose f is convex. Then for alla, O<a< 1, 
flay+(1—a@)x) < af(y)+ (1 — a) f(x). 
Thus forO<a<l 


f(x+ a(y—x)) — f) 


Qa 


< fly) — fx). 
Letting a — 0 we obtain 


Vf(x)(y —x) < fly) — f(x). 


This proves the “only if” part. 
Now assume 


Sly) = f(x) + V(x) (y — x) 
for all x, ye Q. Fix x,, x, € 0 anda, 0< a< 1. Setting x = ax, + (1— a@)x, and 
alternatively y = x, or y = X,, we have 
f(x) 2 f(x) + VAX) (x, — x) (10) 
S(%&) = f(x) + VAX) (& — x). (11) 


Multiplying (10) by @ and (11) by (1 — a) and adding, we obtain 


af(x) + (1 — a) f(%) > f(x) + Vf) Lax, + (1 — @)x, — x]. 


But substituting x = ax, + (1 — @)x,, we obtain 


af(x,)+(—a@)f(x,) > flax, + (1—@)x,). 1 


The statement of the above proposition is illustrated in Fig. 7.4. It can be 
regarded as a sort of dual characterization of the original definition illustrated in 
Fig. 7.3. The original definition essentially states that linear interpolation between 
two points overestimates the function, while the above proposition states that linear 
approximation based on the local derivative underestimates the function. 

For twice continuously differentiable functions, there is another characterization 
of convexity. 
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fO) 


= f@) + VIG) (y-*) 


Fig. 7.4 Illustration of Proposition 4 


Proposition 5. Let f € C?. Then f is convex over a convex set O, containing 
an interior point if and only if the Hessian matrix F of f is positive semidefinite 
throughout ©. 


Proof. By Taylor’s theorem we have 


Fly) = fx) = VA) (y —x) + 3(y — x) F(K+ ay —x))(y —x) (12) 


for some a, 0< a < 1. Clearly, if the Hessian is everywhere positive semidefinite, 
we have 


f(y) 2 f) + VA) (y — x), (13) 


which in view of Proposition 4 implies that f is convex. 

Now suppose the Hessian is not positive semidefinite at some point x € 0. By 
continuity of the Hessian it can be assumed, without loss of generality, that x is an 
interior point of . There is a y € such that (y — x)’ F(x)(y —x) < 0. Again by 
the continuity of the Hessian, y may be selected so that for alla, O<a< 1, 


(y—x)"F(x+ a(y —x))(y—x) <0. 


This in view of (12) implies that (13) does not hold; which in view of Proposition 4 
implies that f is not convex. 


The Hessian matrix is the generalization to E” of the concept of the curvature 
of a function, and correspondingly, positive definiteness of the Hessian is the 
generalization of positive curvature. Convex functions have positive (or at least 
nonnegative) curvature in every direction. Motivated by these observations, we 
sometimes refer to a function as being locally convex if its Hessian matrix is positive 
semidefinite in a small region, and locally strictly convex if the Hessian is positive 
definite in the region. In these terms we see that the second-order sufficiency result 
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of the last section requires that the function be locally strictly convex at the point 
x*. Thus, even the local theory, derived solely in terms of the elementary calculus, 
is actually intimately related to convexity—at least locally. For this reason we can 
view the two theories, local and global, not as disjoint parallel developments but 
as complementary and interactive. Results that are based on convexity apply even 
to nonconvex problems in a region near the solution, and conversely, local results 
apply to a global minimum point. 


7.5 MINIMIZATION AND MAXIMIZATION 
OF CONVEX FUNCTIONS 


We turn now to the three classic results concerning minimization or maximization 
of convex functions. 


Theorem 1. Let f be a convex function defined on the convex set O. Then 
the set 1 where f achieves its minimum is convex, and any relative minimum 
of f is a global minimum. 


Proof. If f has no relative minima the theorem is valid by default. Assume now 
that co is the minimum of f. Then clearly [ = {x: f(x) < cy, x € O} and this is 
convex by Proposition 3 of the last section. 

Suppose now that x* € © is a relative minimum point of f, but that there 
is another point y € 1 with f(y) < f(x*). On the line ay+ (1 —a)x*, O<a< 1 
we have 


flay + (1 — @)x") < af(y) + (1 — a) f(x") < f(x"), 
contradicting the fact that x* is a relative minimum point. J 


We might paraphrase the above theorem as saying that for convex functions, all 
minimum points are located together (in a convex set) and all relative minima are 
global minima. The next theorem says that if f is continuously differentiable and 
convex, then satisfaction of the first-order necessary conditions are both necessary 
and sufficient for a point to be a global minimizing point. 


Theorem 2. Let f € C! be convex on the convex set ©. If there is a point 
x* €Q, such that, for ally € O, Vf(x*)(y —x*) > 0, then x* is a global minimum 
point of f over Q. 


Proof. We note parenthetically that since y — x* is a feasible direction at x”, 
the given condition is equivalent to the first-order necessary condition stated in 
Section 7.1. The proof of the proposition is immediate, since by Proposition 4 of 
the last section 


Sy) 2 FE) + VIR) yy —x*) > fo). Tl 


Next we turn to the question of maximizing a convex function over a convex 
set. There is, however, no analog of Theorem 1 for maximization; indeed, the 
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tendency is for the occurrence of numerous nonglobal relative maximum points. 
Nevertheless, it is possible to prove one important result. It is not used in subsequent 
chapters, but it is useful for some areas of optimization. 


Theorem 3. Let f be a convex function defined on the bounded, closed convex 
set O. If f has a maximum over Q, it is achieved at an extreme point of ©. 


Proof. Suppose f achieves a global maximum at x* € 2. We show first that this 
maximum is achieved at some boundary point of (.. If x* is itself a boundary point, 
then there is nothing to prove, so assume x* is not a boundary point. Let L be any 
line passing through the point x*. The intersection of this line with Q is an interval 
of the line L having end points y,, y, which are boundary points of 0, and we have 
x* = ay, +(1—a)y, for some a, 0 < a < 1. By convexity of f 


S(®*) Safty) + U1 a) f (yo) < max{flyi), f(y2)}- 


Thus either f(y,) or f(y,) must be at least as great as f(x*). Since x* is a maximum 
point, so is either y, or y>. 

We have shown that the maximum, if achieved, must be achieved at a boundary 
point of ©. If this boundary point, x*, is an extreme point of © there is nothing 
more to prove. If it is not an extreme point, consider the intersection of Q with a 
supporting hyperplane H at x*. This intersection, T,, is of dimension n — 1 or less 
and the global maximum of f over 7, is equal to f(x*) and must be achieved at 
a boundary point x, of 7,. If this boundary point is an extreme point of 7), it is 
also an extreme point of QO by Lemma 1, Section B.4, and hence the theorem is 
proved. If x, is not an extreme point of 7,, we form T,, the intersection of 7, with a 
hyperplane in E”~! supporting T, at x,. This process can continue at most a total of 
n times when a set 7, of dimension zero, consisting of a single point, is obtained. 
This single point is an extreme point of 7, and also, by repeated application of 
Lemma 1, Section B.4, an extreme point of 0. J 


7.6 ZERO-ORDER CONDITIONS 


We have considered the problem 


minimize f(x) 
subject to xEQ (14) 


to be unconstrained because there are no functional constraints of the form g(x) < b 
or h(x) = c. However, the problem is of course constrained by the set Q. This 
constraint influences the first- and second-order necessary and sufficient conditions 
through the relation between feasible directions and derivatives of the function f. 
Nevertheless, there is a way to treat this constraint without reference to derivatives. 
The resulting conditions are then of zero order. These necessary conditions require 
that the problem be convex is a certain way, while the sufficient conditions require 
no assumptions at all. The simplest assumptions for the necessary conditions are 
that © is a convex set and that f is a convex function on all of E”. 
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Hyperplane 


Fig. 7.5 The epigraph, the tubular region, and the hyperplane 


To derive the necessary conditions under these assumptions consider the set 
T cE"! ={(r,x):r> f(x), x € E"}. Ina figure of the graph of f, the set I is the 
region above the graph, shown in the upper part of Fig. 7.5. This set is called the 
epigraph of f. It is easy to verify that the set [ is convex if f is a convex function. 

Suppose that x* € Q is the minimizing point with value f* = f(x*). We 
construct a tubular region with cross section Q, and extending vertically from —oco 
up to f*, shown as B in the upper part of Fig. 7.5. This is also a convex set, and it 
overlaps the set I only at the boundary point (f*, b*) above x* (or possibly many 
boundary points if f is flat near x*). 

According to the separating hyperplane theorem (Appendix B), there is a 
hyperplane separating these two sets. This hyperplane can be represented by a 
nonzero vector of the form (s, A) € E"t! with s a scalar and A € E", and a 
separation constant c. The separation conditions are 


sr+A’x>c_ for all x € E” and r> f(x) (15) 
sr+A’x<c forallxe Q andr< f*. (16) 


It follows that s 4 0; for otherwise A 4 0 and then (15) would be violated for some 
x € E”. It also follows that s >0 since otherwise (16) would be violated by very 
negative values of r. Hence, together we find s > 0 and by appropriate scaling we 
may take s= 1. 

It is easy to see that the above conditions can be expressed alternatively as two 
optimization problems, as stated in the following proposition. 


Proposition I (Zero-order necessary conditions). [f x* solves (14) under the 
stated convexity conditions, then there is a nonzero vector A € E" such that x* 
is a solution to the two problems: 
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minimize f(x)+A'x 

subject to x€E" (17) 
and 

maximize A'x 


subject to xEQ. (18) 


Proof. Problem (17) follows from (15) (with s = 1) and the fact that f(x) <r 
for r > f(x). The value c is attained from above at (f*, x*). Likewise (18) follows 
from (16) and the fact that x* and the appropriate r attain c from below. ff 


Notice that problem (17) is completely unconstrained, since x may range over 
all of E”. The second problem (18) is constrained by ©, but has a linear objective 
function. 

It is clear from Fig. 7.5 that the slope of the hyperplane is equal to the slope 
of the function f when f is continuously differentiable at the solution x*. 

If the optimal solution x* is in the interior of ©, then the second problem (18) 
implies that A = 0, for otherwise there would be a direction of movement from 
x* that increases the product A’x above A’x*. The hyperplane is horizontal in 
that case. The zeroth-order conditions provide no new information in this situation. 
However, when the solution is on a boundary point of © the conditions give very 
useful information. 


Example 1 (Minimization over an interval). Consider a continuously differen- 
tiable function f of a single variable x € E! defined on the unit interval [0,1] which 
plays the role of © here. The first problem (17) implies f’(x*) = —A. If the solution 
is at the left end of the interval (at x = 0) then the second problem (18) implies 
that A < 0 which means that f’(x*) > 0. The reverse holds if x* is at the right end. 
These together are identical to the first-order conditions of section 7.1. 


Example 2 As a generalization of the above example, let f ¢ C! on E"”, and let f 
have a minimum with respect to © at x*. Let d € E” be a feasible direction at x*. 
Then it follows again from (17) that Vf(x*)d > 0. 


Sufficient Conditions. The conditions of Proposition 1 are sufficient for x* to be 
a minimum even without the convexity assumptions. 


Proposition 2 (Zero-order sufficiency conditions). Jf there is a X such that 
x* € QO solves the problems (17) and (18), then x* solves (14). 


Proof. Suppose x, is any other point in Q. Then from (17) 
SK) FAT, > f(*)+ATx*, 
This can be rewritten as 


f(&1) — f(x") > ATX — ATX). 
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By problem (18) the right hand side of this is greater than or equal to zero. Hence 
f(x,) — f(x*) > 0 which establishes the result. 


7.7 GLOBAL CONVERGENCE OF DESCENT 
ALGORITHMS 


A good portion of the remainder of this book is devoted to presentation and analysis 
of various algorithms designed to solve nonlinear programming problems. Although 
these algorithms vary substantially in their motivation, application, and detailed 
analysis, ranging from the simple to the highly complex, they have the common 
heritage of all being iterative descent algorithms. By iterative, we mean, roughly, 
that the algorithm generates a series of points, each point being calculated on the 
basis of the points preceding it. By descent, we mean that as each new point is 
generated by the algorithm the corresponding value of some function (evaluated at 
the most recent point) decreases in value. Ideally, the sequence of points generated 
by the algorithm in this way converges in a finite or infinite number of steps to a 
solution of the original problem. 

An iterative algorithm is initiated by specifying a starting point. If for arbitrary 
starting points the algorithm is guaranteed to generate a sequence of points 
converging to a solution, then the algorithm is said to be globally convergent. Quite 
definitely, not all algorithms have this obviously desirable property. Indeed, many of 
the most important algorithms for solving nonlinear programming problems are not 
globally convergent in their purest form and thus occasionally generate sequences 
that either do not converge at all or converge to points that are not solutions. It is 
often possible, however, to modify such algorithms, by appending special devices, 
so as to guarantee global convergence. 

Fortunately, the subject of global convergence can be treated in a unified 
manner through the analysis of a general theory of algorithms developed mainly 
by Zangwill. From this analysis, which is presented in this section, we derive 
the Global Convergence Theorem that is applicable to the study of any iterative 
descent algorithm. Frequent reference to this important result is made in subsequent 
chapters. 


Algorithms 


We think of an algorithm as a mapping. Given a point x in some space X, the 
output of an algorithm applied to x is a new point. Operated iteratively, an algorithm 
is repeatedly reapplied to the new points it generates so as to produce a whole 
sequence of points. Thus, as a preliminary definition, we might formally define 
an algorithm A as a mapping taking points in a space X into (other) points in 
X. Operated iteratively, the algorithm A initiated at x) € X would generate the 
sequence {x,} defined by 


X41 = AC). 
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In practice, the mapping A might be defined explicitly by a simple mathematical 
expression or it might be defined implicitly by, say, a lengthy complex computer 
program. Given an input vector, both define a corresponding output. 

With this intuitive idea of an algorithm in mind, we now generalize the concept 
somewhat so as to provide greater flexibility in our analyses. 


Definition. An algorithm A is a mapping defined on a space X that assigns 
to every point x € X a subset of X. 


In this definition the term “space” can be interpreted loosely. Usually X is the 
vector space E” but it may be only a subset of E” or even a more general metric 
space. The most important aspect of the definition, however, is that the mapping 
A, rather than being a point-to-point mapping of X, is a point-to-set mapping of X. 

An algorithm A generates a sequence of points in the following way. Given 
x, € X the algorithm yields A(x,) which is a subset of X. From this subset an 
arbitrary element x;,.., is selected. In this way, given an initial point Xo, the algorithm 
generates sequences through the iteration 


X41 € A(x;). 


It is clear that, unlike the case where A is a point-to-point mapping, the sequence 
generated by the algorithm A cannot, in general, be predicted solely from knowledge 
of the initial point x). This degree of uncertainty is designed to reflect uncertainty 
that we may have in practice as to specific details of an algorithm. 


Example 1. Suppose for x on the real line we define 


A(x) =[—|x1/2, 


x|/2] 


so that A(x) is an interval of the real line. Starting at xy = 100, each of the sequences 
below might be generated from iterative application of this algorithm. 


100; 50, 25. 12, —6.. 2). 1. 1722... 
100; -—40, 20. <5, =<2..1, 1/4, 1/8... 
100, 10, —1, 1/16, 1/100, —1/1000, 1/10, 000, ... 


The apparent ambiguity that is built into this definition of an algorithm is not 
meant to imply that actual algorithms are random in character. In actual imple- 
mentation algorithms are not defined ambiguously. Indeed, a particular computer 
program executed twice from the same starting point will generate two copies of the 
same sequence. In other words, in practice algorithms are point-to-point mappings. 
The utility of the more general definition is that it allows one to analyze, in a 
single step, the convergence of an infinite family of similar algorithms. Thus, two 
computer programs, designed from the same basic idea, may differ slightly in some 
details, and therefore perhaps may not produce identical results when given the 
same starting point. Both programs may, however, be regarded as implementations 
of the same point-to-set mappings. In the example above, for instance, it is not 
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necessary to know exactly how x,,, is determined from x, so long as it is known 
that its absolute value is no greater than one-half x,’s absolute value. The result will 
always tend toward zero. In this manner, the generalized concept of an algorithm 
sometimes leads to simpler analysis. 


Descent 


In order to describe the idea of a descent algorithm we first must agree on a subset 
I’ of the space X, referred to as the solution set. The basic idea of a descent function, 
which is defined below, is that for points outside the solution set, a single step of 
the algorithm yields a decrease in the value of the descent function. 


Definition. Let [Cc X be a given solution set and let A be an algorithm on 
X. A continuous real-valued function Z on X is said to be a descent function 
for T and A if it satisfies 


i) ifx ¢ T and y € A(x), then Z(y) < Z(x) 
ii) if x ¢ T and y € A(x), then Z(y) < Z(x). 


There are a number of ways a solution set, algorithm, and descent function can 
be defined. A natural set-up for the problem 


minimize f(x) 
subject to xEQ ) 
is to let [ be the set of minimizing points, and define an algorithm A on © in 
such a way that f decreases at each step and thereby serves as a descent function. 
Indeed, this is the procedure followed in a majority of cases. Another possibility 
for unconstrained problems is to let I be the set of points x satisfying V(x) = 0. 
In this case we might design an algorithm for which |V/f(x)| serves as a descent 
function or for which f(x) serves as a descent function. 


Closed Mappings 


An important property possessed by some algorithms is that they are closed. This 
property, which is a generalization for point-to-set mappings of the concept of 
continuity for point-to-point mappings, turns out to be the key to establishing a 
general global convergence theorem. In defining this property we allow the point- 
to-set mapping to map points in one space X into subsets of another space Y. 


Definition. A point-to-set mapping A from X to Y is said to be closed at 
x € X if the assumptions 


i) X, > x, x, € X, 
ii) y, > y, y, © ACK) 
imply 
iti) y € A(x). 
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closed not closed 


Fig. 7.6 Graphs of mappings 


The point-to-set map A is said to be closed on X if it is closed at each point of X. 


Example 2. As a special case, suppose that the mapping A is a point-to-point 
mapping; that is, for each x € X the set A(x) consists of a single point in Y. Suppose 
also that A is continuous at x € X. This means that if x, > x then A(x,) > A(x), 
and it follows that A is closed at x. Thus for point-to-point mappings continuity 
implies closedness. The converse is, however, not true in general. 

The definition of a closed mapping can be visualized in terms of the graph 
of the mapping, which is the set {(x, y):x € X,y € A(x)}. If X is closed, then A 
is closed throughout X if and only if this graph is a closed set. This is illustrated 
in Fig. 7.6. However, this equivalence is valid only when considering closedness 
everywhere. In general a mapping may be closed at some points and not at others. 


Example 3. The reader should verify that the point-to-set mapping defined in 
Example | is closed. 


Many complex algorithms that we analyze are most conveniently regarded 
as the composition of two or more simple point-to-set mappings. It is therefore 
natural to ask whether closedness of the individual maps implies closedness of the 
composite. The answer is a qualified “yes.” The technical details of composition 
are described in the remainder of this subsection. They can safely be omitted at 
first reading while proceeding to the Global Convergence Theorem. 


Definition. Let A: X — Y and B: Y — Z be point-to-set mappings. The 
composite mapping C = BA is defined as the point-to-set mapping C : X > 
Z with 


C(x) = ag B(y). 


This definition is illustrated in Fig. 7.7. 


Proposition. Let A: X — Y andB: Y — Z be point-to-set mappings. Suppose 
A is closed at x and B is closed on A(x). Suppose also that if x, — x and 
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Fig. 7.7 Composition of mappings 


y, € A(Xx,), there is a y such that, for some subsequence {Y;,;}, Yy; > y. Then 
the composite mapping C = BA is closed at x. 


Proof. Let x, —> x and z, > z with z, € C(x,). It must be shown that z € C(x). 
Select y, € A(x,) such that z, € B(y,) and according to the hypothesis let y 
and {y,;} be such that y,; > y. Since A is closed at x it follows that y € A(x). 
Likewise, since y,; > y, Z,; > z and B is closed at y, it follows that z € B(y) C 
BA(x) = C(x).] 


Two important corollaries follow immediately. 


Corollary 1. Let A: X > Y and B: Y > Z be point-to-set mappings. If A 
is closed at x, B is closed on A(x) and Y is compact, then the composite map 
C =BA is closed at x. 


Corollary 2. Let A: X — Y be a point-to-point mapping and B: Y > Z a 
point-to-set mapping. If A is continuous at x and B is closed at A(x), then the 
composite mapping C = BA is closed at x. 


Global Convergence Theorem 


The Global Convergence Theorem is used to establish convergence for the following 
general situation. There is a solution set T. Points are generated according to 
the algorithm x,,, € A(x,), and each new point always strictly decreases a 
descent function Z unless the solution set I’ is reached. For example, in nonlinear 
programming, the solution set may be the set of minimum points (perhaps only 
one point), and the descent function may be the objective function itself. A suitable 
algorithm is found that generates points such that each new point strictly reduces 
the value of the objective. Then, under appropriate conditions, it follows that the 
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sequence converges to the solution set. The Global Convergence Theorem estab- 
lishes technical conditions for which convergence is guaranteed. 


Global Convergence Theorem. Let A be an algorithm on X, and suppose that, 
given Xo the sequence {x;,}?) is generated satisfying 


X41 € A(X). 


Let a solution set 1 C X be given, and suppose 


i) all points x, are contained in a compact set S C X 
ii) there is a continuous function Z on X such that 


(a) ifx € 1, then Z(y) < Z(x) for all y € A(x) 
(b) ifx ET, then Z(y) < Z(x) for all y € A(x) 


iii) the mapping A is closed at points outside T. 
Then the limit of any convergent subsequence of {x,} is a solution. 


Proof. Suppose the convergent subsequence {x,}, k € K converges to the limit x. 
Since Z is continuous, it follows that for k € K, Z(x,) > Z(x). This means that Z is 
convergent with respect to the subsequence, and we shall show that it is convergent 
with respect to the entire sequence. By the monotonicity of Z on the sequence {x,} 
we have Z(x,) — Z(x) > 0 for all k. By the convergence of Z on the subsequence, 
there is, for a given e > 0, a K € KX such that Z(x,) — Z(x) < « for all k> K, 
kKex. 
Thus for all k > K 


Z(x,) — Z(X) = ZK) — Z(%&x) + Z (KK) — Z(X) < €, 


which shows that Z(x;,) > Z(x). 

To complete the proof it is only necessary to show that x is a solution. Suppose 
x is not a solution. Consider the subsequence {x,,,}. Since all members of this 
sequence are contained in a compact set, there is a K C K such that {x, sibs 
converges to some limit x. We thus have x, > x, k € K, and X41 € A(X;,) with 
Xx41 > X, k € K. Thus since A is closed at x it follows that x € A(x). But from 
above, Z(x) = Z(x) which contradicts the fact that Z is a descent function. J 


Corollary. If under the conditions of the Global Convergence Theorem T 
consists of a single point x, then the sequence {x,} converges to X. 


Proof. Suppose to the contrary that there is a subsequence {x,}, and an e > 0 
such that |x, —x| > ¢ for all k € K. By compactness there must be K’ C K such that 
{x,} 4, converges, say to x’. Clearly, |x’ —x| > e, but by the Global Convergence 
Theorem x’ € T, which is a contradiction. 


In later chapters the Global Convergence Theorem is used to establish 
the convergence of several standard algorithms. Here we consider some simple 
examples designed to illustrate the roles of the various conditions of the theorem. 


7.7 Global Convergence of Descent Algorithms 207 


Example 4. In many respects condition (iii) of the theorem, the closedness of 
A outside the solution set, is the most important condition. The failure of many 
popular algorithms can be traced to nonsatisfaction of this condition. On the real 
line consider the point-to-point algorithm 

AGS = eta x1 

5X x<l 

and the solution set = {0}. It is easily verified that a descent function for this 
solution set and this algorithm is Z(x) = |x|. However, starting from x > 1, the 
algorithm generates a sequence converging to x = | which is not a solution. The 
difficulty is that A is not closed at x = 1. 


Example 5. On the real line X consider the solution set to be empty, the descent 
function Z(x) = e*, and the algorithm A(x) = x+ 1. All conditions of the conver- 
gence theorem except (i) hold. The sequence generated from any starting condition 
diverges to infinity. This is not strictly a violation of the conclusion of the theorem 
but simply an example illustrating that if no compactness assumption is introduced, 
the generated sequence may have no convergent subsequence. 


Example 6. Consider the point-to-set algorithm A defined by the graph in Fig. 7.8 
and given explicitly on X = [0, 1] by 


where [0, x) denotes a half-open interval (see Appendix A). Letting [ = {0}, the 
function Z(x) = x serves as a descent function, because for x 4 0 all points in A(x) 
are less than x. 


A(x) 
A 


Tr 


0 1 


Fig. 7.8 Graph for Example 6 
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The sequence defined by 


xo =1 


1 


Xe = XR DR? 


satisfies x,,, € A(x,) but it can easily be seen that x, > ; ¢ I. The difficulty here, 
of course, is that the algorithm A is not closed outside the solution set. 


7.8. SPEED OF CONVERGENCE 


The study of speed of convergence is an important but sometimes complex subject. 
Nevertheless, there is a rich and yet elementary theory of convergence rates that 
enables one to predict with confidence the relative effectiveness of a wide class 
of algorithms. In this section we introduce various concepts designed to measure 
speed of convergence, and prepare for a study of this most important aspect of 
nonlinear programming. 


Order of Convergence 


Consider a sequence of real numbers {r,}7.) converging to the limit r*. We define 
several notions related to the speed of convergence of such a sequence. 


Definition. Let the sequence {r,} converge to r*. The order of convergence 
of {r,} is defined as the supremum of the nonnegative numbers p satisfying 


To ensure that the definition is applicable to any sequence, it is stated in terms 
of limit superior rather than just limit and 0/0 (which occurs if r, = r* for all k) is 
regarded as finite. But these technicalities are rarely necessary in actual analysis, 
since the sequences generated by algorithms are generally quite well behaved. 

It should be noted that the order of convergence, as with all other notions related 
to speed of convergence that are introduced, is determined only by the properties 
of the sequence that hold as k —> co. Somewhat loosely but picturesquely, we are 
therefore led to refer to the tail of a sequence—that part of the sequence that is 
arbitrarily far out. In this language we might say that the order of convergence is 
a measure of how good the worst part of the tail is. Larger values of the order p 
imply, in a sense, faster convergence, since the distance from the limit r* is reduced, 
at least in the tail, by the pth power in a single step. Indeed, if the sequence has 
order p and (as is the usual case) the limit 
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exists, then asymptotically we have 
ei — |= Bl rl. 


k 


Example 1. The sequence with r, = a“ where 0 < a < | converges to zero with 


order unity, since %4,/"m =. 


Example 2. The sequence with r, = a?) for0<a<1 converges to zero with 
order two, since r,.4,/rz = 1. 


Linear Convergence 


Most algorithms discussed in this book have an order of convergence equal to unity. 
It is therefore appropriate to consider this class in greater detail and distinguish 
certain cases within it. 


Definition. If the sequence {r,} converges to r* in such a way that 


the sequence is said to converge linearly to r* with convergence ratio (or 


rate) B. 


Linear convergence is, for our purposes, without doubt the most important 
type of convergence behavior. A linearly convergent sequence, with convergence 
ratio B, can be said to have a tail that converges at least as fast as the geometric 
sequence cB* for some constant c. Thus linear convergence is sometimes referred 
to as geometric convergence, although in this book we reserve that phrase for the 
case when a sequence is exactly geometric. 

As a rule, when comparing the relative effectiveness of two competing 
algorithms both of which produce linearly convergent sequences, the comparison is 
based on their corresponding convergence ratios—the smaller the ratio the faster the 
rate. The ultimate case where B = 0 is referred to as superlinear convergence. We 
note immediately that convergence of any order greater than unity is superlinear, 
but it is also possible for superlinear convergence to correspond to unity order. 


Example 3. The sequence r, = 1/k converges to zero. The convergence is of 
order one but it is not linear, since lim (r,,,/r;,) = 1, that is, B is not strictly less 
k- 00 


than one. 
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Example 4. The sequence r, = (1/k)* is of order unity, since r,,,/rf > 00 for 
p> 1. However, r,,.,/r% — 0 as k — oo and hence this is superlinear convergence. 


“Average Rates 


All the definitions given above can be referred to as step-wise concepts of conver- 
gence, since they define bounds on the progress made by going a single step: from 
k to k+1. Another approach is to define concepts related to the average progress 
per step over a large number of steps. We briefly illustrate how this can be done. 


Definition. Let the sequence {r,} converge to r*. The average order of 
convergence is the infimum of the numbers p > 1 such that 


Taher a. 
The order is infinity if the equality holds for no p> 1. 
Example 5. For the sequence y= a®),0<a<1, given in Example 2, we have 
[ry | ie =a, 
while 
(r,|'/" = a2/P* > 1 
for p > 2. Thus the average order is two. 
Example 6. For r, = a‘ with 0 < a < | we have 
(r,)/™ = ake _, 1 


for any p > 1. Thus the average order is unity. 


As before, the most important case is that of unity order, and in this case we 
define the average convergence ratio as lim|r, —1r*|'/*. Thus for the geometric 
k-00 


sequence r, = ca‘, 0 < a < 1, the average convergence ratio is a. Paralleling the 
earlier definitions, the reader can then in a similar manner define corresponding 
notions of average linear and average superlinear convergence. 

Although the above array of definitions can be further embellished and 
expanded, it is quite adequate for our purposes. For the most part we work with 
the step-wise definitions, since in analyzing iterative algorithms it is natural to 
compare one step with the next. In most situations, moreover, when the sequences 
are well behaved and the limits exist in the definitions, then the step-wise and 
average concepts of convergence rates coincide. 
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“Convergence of Vectors 


Suppose {x;,}?°.9 is a sequence of vectors in E” converging to a vector x*. The conver- 
gence properties of such a sequence are defined with respect to some particular 
function that converts the sequence of vectors into a sequence of numbers. Thus, 
if f is a given continuous function on E”, the convergence properties of {x,} 
can be defined with respect to f by analyzing the convergence of f(x,) to f(x*). 
The function f used in this way to measure convergence is called the error function. 

In optimization theory it is common to choose the error function by which to 
measure convergence as the same function that defines the objective function of the 
original optimization problem. This means we measure convergence by how fast the 
objective converges to its minimum. Alternatively, we sometimes use the function 
|x —x*|? and thereby measure convergence by how fast the (squared) distance from 
the solution point decreases to zero. 

Generally, the order of convergence of a sequence is insensitive to the particular 
error function used; but for step-wise linear convergence the associated convergence 
ratio is not. Nevertheless, the average convergence ratio is not too sensitive, as the 
following proposition demonstrates, and hence the particular error function used to 
measure convergence is not really very important. 


Proposition. Let f and g be two error functions satisfying f(x*) = g(x*) =0 
and, for all x, a relation of the form 


0 < a,g(x) < f(x) < a(x) 


for some fixed a, > 0, ay > 0. If the sequence {x,}"°_, converges to x* linearly 
with average ratio B with respect to one of these functions, it also does so with 
respect to the other. 


Proof. The statement is easily seen to be symmetric in f and g. Thus we assume 
{x,} is linearly convergent with average convergence ratio B with respect to f, and 
will prove that the same is true with respect to g. We have 


B= Tim f(x,)'"" < Tim ay!"g(x,)'" = Tim g(x,)'/* 
and 

p= Tim f(x)" > Tim a}!g(ay)"" = Tim (x) 
Thus 

B= Tim g(x)" I 
As an example of an application of the above proposition, consider the case 

where g(x) = |x —x*|? and f(x) = (x—x*)’Q(x—x*), where Q is a positive definite 
symmetric matrix. Then a, and a, correspond, respectively, to the smallest and 


largest eigenvalues of Q. Thus average linear convergence is identical with respect 
to any error function constructed from a positive definite quadratic form. 
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Complexity 


Complexity theory as outlined in Section 5.1 is an important aspect of convergence 
theory. This theory can be used in conjunction with the theory of local convergence. 
If an algorithm converges according to any order greater than zero, then for a 
fixed problem, the sequence generated by the algorithm will converge in a time 
that is a function of the convergence order (and rate, if convergence is linear). For 
example, if the order is one with rate 0 < c < | and the process begins with an 
error of R, a final error of r can be achieved by a number of steps n satisfing 
c"R <r. Thus it requires approximately n = log(R/r)/log(1/c) steps. In this form 
the number of steps is not affected by the size of the problem. However, problem 
size enters in two possible ways. First, the rate c may depend on the size—say 
going toward | as the size increases so that the speed is slower for large problems. 
The second way that size may enter, and this is the more important way, is that 
the time to execute a single step almost always increases with problem size. For 
instance if, for a problem seeking an optimal vector of dimension m, each step 
requires a Gaussian elimination inversion of an m x m matrix, the solution time 
will increase by a factor proportional to m>. Overall the algorithm is therefore a 
polynomial time algorithm. Essentially all algorithms in this book employ steps, 
such as matrix multiplications or inversion or other algebraic operations, which are 
polynomial-time in character. Convergence analysis, therefore, focuses on whether 
an algorithm is globally convergent, on its local convergence properties, and also 
on the order of the algebraic operations required to execute the steps required. The 
last of these is usually easily deduced by listing the number and size of the required 
vector and matrix operations. 


7.9 SUMMARY 


There are two different but complementary ways to characterize the solution to 
unconstrained optimization problems. In the local approach, one examines the 
relation of a given point to its neighbors. This leads to the conclusion that, at an 
unconstrained relative minimum point of a smooth function, the gradient of the 
function vanishes and the Hessian is positive semidefinite; and conversely, if at 
a point the gradient vanishes and the Hessian is positive definite, that point is a 
relative minimum point. This characterization has a natural extension to the global 
approach where convexity ensures that if the gradient vanishes at a point, that point 
is a global minimum point. 

In considering iterative algorithms for finding either local or global minimum 
points, there are two distinct issues: global convergence properties and local conver- 
gence properties. The first is concerned with whether starting at an arbitrary point 
the sequence generated will converge to a solution. This is ensured if the algorithm 
is closed, has a descent function, and generates a bounded sequence. Local conver- 
gence properties are a measure of the ultimate speed of convergence and generally 
determine the relative advantage of one algorithm to another. 
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7.10 EXERCISES 


1. To approximate a function g over the interval [0, 1] by a polynomial p of degree n (or 
less), we minimize the criterion 


1 
fla) = [ [e(x) — p(x)F° dx, 


where p(x) = a,x"+a,_,x""!+...+ a9. Find the equations satisfied by the optimal 
coefficients a = (dp, a),..., 4). 


2. In Example 3 of Section 7.2 show that if the solution has x, > 0, x, +x, = 1, then it is 
necessary that 


b, — by + (c, — y)A(x,) = 0 
by + (Cy — €3) A(X +2) <0. 


Hint: One way is to reformulate the problem in terms of the variables x, and y= x, +X). 


3. a) Using the first-order necessary conditions, find a minimum point of the function 
f(x, y, 2) = 2x? +xyty?+yz+2? —6x—Ty—82+9. 


b) Verify that the point is a relative minimum point by verifying that the second-order 
sufficiency conditions hold. 
c) Prove that the point is a global minimum point. 


4. In this exercise and the next we develop a method for determining whether a given 
symmetric matrix is positive definite. Given an n x n matrix A let A, denote the principal 
submatrix made up of the first k rows and columns. Show (by induction) that if the 
first n — 1 principal submatrices are nonsingular, then there is a unique lower triangular 
matrix L with unit diagonal and a unique upper triangular matrix U such that A = LU. 
(See Appendix C.) 


5. A symmetric matrix is positive definite if and only if the determinant of each of its 
principal submatrices is positive. Using this fact and the considerations of Exercise 4, 
show that an n x n symmetric matrix A is positive definite if and only if it has an LU 
decomposition (without interchange of rows) and the diagonal elements of U are all 
positive. 


6. Using Exercise 5 show that an n x n matrix A is symmetric and positive definite if and 
only if it can be written as A= GG" where G is a lower triangular matrix with positive 
diagonal elements. This representation is known as the Cholesky factorization of A. 


7. Let f;, i¢ 1 be a collection of convex functions defined on a convex set (. Show that 
the function f defined by f(x) = sup f,(x) is convex on the region 
where it is finite. a 

8. Let y be a monotone nondecreasing function of a single variable (that is, y(r) < y(7’) 
for r’ > r) which is also convex; and let f be a convex function defined on a convex 
set ©. Show that the function y(f) defined by y(f)(x) = y[f(x)] is convex on ©. 
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9. Let f be twice continuously differentiable on a region Q Cc E”. Show that a sufficient 
condition for a point x* in the interior of 0 to be a relative minimum point of f is that 
Vf(x*) = 0 and that f be locally convex at x*. 


10. Define the point-to-set mapping on E” by 
A(x) = {y:y’x < 5}, 


where b is a fixed constant. Is A closed? 
11. Prove the two corollaries in Section 7.6 on the closedness of composite mappings. 


12. Show that if A is a continuous point-to-point mapping, the Global Convergence Theorem 
is valid even without assumption (i). Compare with Example 2, Section 7.7. 


13. Let {7,}@2.9 and {c,}%9 be sequences of real numbers. Suppose 7, — 0 average linearly 
and that there are constants c > 0 and C such that c < c, < C for all k. Show that 
C,1, — O average linearly. 


14. Prove a proposition, similar to the one in Section 7.8, showing that the order of conver- 
gence is insensitive to the error function. 


15. Show that ifr, > r* (step-wise) linearly with convergence ratio B, then r, > r* (average) 
linearly with average convergence ratio no greater than B. 
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Chapter8 BASIC DESCENT 
METHODS 


We turn now to a description of the basic techniques used for iteratively solving 
unconstrained minimization problems. These techniques are, of course, important 
for practical application since they often offer the simplest, most direct alterna- 
tives for obtaining solutions; but perhaps their greatest importance is that they 
establish certain reference plateaus with respect to difficulty of implementation 
and speed of convergence. Thus in later chapters as more efficient techniques and 
techniques capable of handling constraints are developed, reference is continually 
made to the basic techniques of this chapter both for guidance and as points of 
comparison. 

There is a fundamental underlying structure for almost all the descent 
algorithms we discuss. One starts at an initial point; determines, according to 
a fixed rule, a direction of movement; and then moves in that direction to a 
(relative) minimum of the objective function on that line. At the new point a 
new direction is determined and the process is repeated. The primary differences 
between algorithms (steepest descent, Newton’s method, etc.) rest with the rule 
by which successive directions of movement are selected. Once the selection is 
made, all algorithms call for movement to the minimum point on the corresponding 
line. 

The process of determining the minimum point on a given line is called 
line search. For general nonlinear functions that cannot be minimized analyti- 
cally, this process actually is accomplished by searching, in an intelligent manner, 
along the line for the minimum point. These line search techniques, which are 
really procedures for solving one-dimensional minimization problems, form the 
backbone of nonlinear programming algorithms, since higher dimensional problems 
are ultimately solved by executing a sequence of successive line searches. There 
are a number of different approaches to this important phase of minimization and 
the first half of this chapter is devoted to their, discussion. 

The last sections of the chapter are devoted to a description and analysis of the 
basic descent algorithms for unconstrained problems; steepest descent, Newton’s 
method, and coordinate descent. These algorithms serve as primary models for the 
development and analysis of all others discussed in the book. 
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8.1. FIBONACCI AND GOLDEN SECTION SEARCH 


A very popular method for resolving the line search problem is the Fibonacci search 
method described in this section. The method has a certain degree of theoretical 
elegance, which no doubt partially accounts for its popularity, but on the whole, as 
we shall see, there are other procedures which in most circumstances are superior. 

The method determines the minimum value of a function f over a closed 
interval [c,, c,]. In applications, f may in fact be defined over a broader domain, 
but for this method a fixed interval of search must be specified. The only property 
that is assumed of f is that it is unimodal, that is, it has a single relative minimum 
(see Fig. 8.1). The minimum point of f is to be determined, at least approximately, 
by measuring the value of f at a certain number of points. It should be imagined, as 
is indeed the case in the setting of nonlinear programming, that each measurement 
of f is somewhat costly—of time if nothing more. 

To develop an appropriate search strategy, that is, a strategy for selecting 
measurement points based on the previously obtained values, we pose the following 
problem: Find how to successively select N measurement points so that, without 
explicit knowledge of f, we can determine the smallest possible region of uncer- 
tainty in which the minimum must lie. In this problem the region of uncertainty is 
determined in any particular case by the relative values of the measured points in 
conjunction with our assumption that f is unimodal. Thus, after values are known 
at N points x,, X),...,X, with 


Cy SX <Q 0 <Xy_y <Xy Ko, 


the region of uncertainty is the interval [x,_,, x;,,] where x, is the minimum point 
among the N, and we define xy = c,, Xy4, = Cy for consistency. The minimum of 
f must lie somewhere in this interval. 


— 


cy C2 


Fig. 8.1 A unimodal function 
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The derivation of the optimal strategy for successively selecting measurement 
points to obtain the smallest region of uncertainty is fairly straight-forward but 
somewhat tedious. We simply state the result and give an example. 

Let 


d, =cC)—C,, the initial width of uncertainty 


d, = width of uncertainty after k measurements. 


Then, if a total of N measurements are to be made, we have 


Frye 
d, = (=) d,, (1) 


N 


where the integers F;, are members of the Fibonacci sequence generated by the 
recurrence relation 


Fy = Fy_)+Fy_s, Fo=F,=1. (2) 


The resulting sequence is 1, 1, 2, 3,5, 8, 13,.... 

The procedure for reducing the width of uncertainty to d, is this: The first two 
measurements are made symmetrically at a distance of (Fy_,/Fy)d, from the ends 
of the initial intervals; according to which of these is of lesser value, an uncertainty 
interval of width d, = (Fy_,/Fy)d, is determined. The third measurement point 
is placed symmetrically in this new interval of uncertainty with respect to the 
measurement already in the interval. The result of this third measurement gives an 
interval of uncertainty d, = (Fy_»/Fy)d,. In general, each successive measurement 
point is placed in the current interval of uncertainty symmetrically with the point 
already existing in that interval. 

Some examples are shown in Fig. 8.2. In these examples the sequence of 
measurement points is determined in accordance with the assumption that each 
measurement is of lower value than its predecessors. Note that the procedure always 
calls for the last two measurements to be made at the midpoint of the semifinal 
interval of uncertainty. We are to imagine that these two points are actually separated 
a small distance so that a comparison of their respective values will reduce the 
interval to nearly half. This terminal anomaly of the Fibonacci search process is, 
of course, of no great practical consequence. 


Search by Golden Section 


If the number AN of allowed measurement points in a Fibonacci search is made to 
approach infinity, we obtain the golden section method. It can be argued, based on 
the optimal property of the finite Fibonacci method, that the corresponding infinite 
version yields a sequence of intervals of uncertainty whose widths tend to zero 
faster than that which would be obtained by other methods. 
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Fig. 8.2 Fibonacci search 


The solution to the Fibonacci difference equation 
Fy = Fy_\ + Fry_o (3) 
is of the form 
Fy = AtY + Bry, (4) 
where 7, and 7, are roots of the characteristic equation 
Patt. 


Explicitly, 


14/5 1-5 
T= : = ; 

2 2 
(The number 7, ~ 1.618 is known as the golden section ratio and was considered 
by early Greeks to be the most aesthetic value for the ratio of two adjacent sides 
of a rectangle.) For large N the first term on the right side of (4) dominates the 
second, and hence 

F 1 

lim — = — ~ 0.618. 


N00 N Ty 
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It follows from (1) that the interval of uncertainty at any point in the process has 
width 
1\*! 
a, = (—) d,, (5) 
7 


1 
a 0.618. (6) 
dx Ty 


and from this it follows that 


Therefore, we conclude that, with respect to the width of the uncertainty interval, 
the search by golden section converges linearly (see Section 7.8) to the overall 
minimum of the function f with convergence ratio 1/7, = 0.618. 


8.2 LINE SEARCH BY CURVE FITTING 


The Fibonacci search method has a certain amount of theoretical appeal, since 
it assumes only that the function being searched is unimodal and with respect 
to this broad class of functions the method is, in some sense, optimal. In most 
problems, however, it can be safely assumed that the function being searched, as 
well as being unimodal, possesses a certain degree of smoothness, and one might, 
therefore, expect that more efficient search techniques exploiting this smoothness 
can be devised; and indeed they can. Techniques of this nature are usually based 
on curve fitting procedures where a smooth curve is passed through the previously 
measured points in order to determine an estimate of the minimum point. A variety 
of such techniques can be devised depending on whether or not derivatives of the 
function as well as the values can be measured, how many previous points are 
used to determine the fit, and the criterion used to determine the fit. In this section 
a number of possibilities are outlined and analyzed. All of them have orders of 
convergence greater than unity. 


Newton’s Method 


Suppose that the function f of a single variable x is to be minimized, and suppose 
that at a point x, where a measurement is made it is possible to evaluate the three 
numbers f(x,), f’(x,), f”(x,). It is then possible to construct a quadratic function 
q which at x, agrees with f up to second derivatives, that is 


q(x) = fn) +f A= Ae) +f" A=)? (7) 


We may then calculate an estimate x,,, of the minimum point of f by finding the 
point where the derivative of g vanishes. Thus setting 


0= d (Xp41) = f'(%) +f" (x, ) es iy Ns 
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Xk+1 Xk 


Fig. 8.3 Newton’s method for minimization 


we find 


f'(%) 


ee Fa) 


This process, which is illustrated in Fig. 8.3, can then be repeated at x,,,. 

We note immediately that the new point x,,, resulting from Newton’s method 
does not depend on the value f(x,). The method can more simply be viewed as a 
technique for iteratively solving equations of the form 


(8) 


g(x) =0, 


where, when applied to minimization, we put g(x) = f’(x). In this notation Newton’s 

method takes the form 

_& (xx) 
8! (x;) 


(9) 


Xe = XK 


This form is illustrated in Fig. 8.4. 
We now show that Newton’s method has order two convergence: 
Proposition. Let the function g have a continuous second derivative, and let 
x* satisfy g(x*) =0, g’(x*) 40. Then, provided x, is sufficiently close to x*, 
the sequence {x,}%) generated by Newton’s method (9) converges to x* with 
an order of convergence at least two. 


Proof. For points € in a region near x* there is a k, such that |g”(€)| <k, anda 
k, such that |g’(é)| > k,. Then since g(x*) = 0 we can write 


eo exten, — 72 — 86 8") 
k+1 k g!(x;) 
= —[8(%,) — 80°) + 8 (x) — x4) 1/8’). 
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Xk+1 Xk 
Fig. 8.4 Newton’s method for solving equations 


The term in brackets is, by Taylor’s theorem, zero to first-order. In fact, using the 
remainder term in a Taylor series expansion about x,, we obtain 


__18'® 
2 g(x) 


for some € between x* and x,. Thus in the region near x*, 


(x, xy 


Kei — % 


ky 
Xpi,—x*| < — |x, — 2"). 
bia 21S eb 
We see that if |x,—.x*|k,/2k, < 1, then |x,,, —x*| < |x, —.x*| and thus we conclude 
that if started close enough to the solution, the method will converge to x* with an 
order of convergence at least two. J 


Method of False Position 


Newton’s method for minimization is based on fitting a quadratic on the basis of 
information at a single point; by using more points, less information is required at 
each of them. Thus, using f(x,), f’(x,), f’(%,_;) it is possible to fit the quadratic 


Sf (1) — F(R) ; (x—x,) 


Xp_1 — Xz 2, 


G(x) = f(x) +f i) — mi) + 


which has the same corresponding values. An estimate x,,, can then be determined 
by finding the point where the derivative of g vanishes; thus 


Xp =X — Sf (%) | . (10) 


(See Fig. 8.5.) Comparing this formula with Newton’s method, we see again that 
the value f(x,) does not enter; hence, our fit could have been passed through 
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Fig. 8.5 False position for minimization 


either f(x,) or f(x,_,). Also the formula can be regarded as an approximation to 
Newton’s method where the second derivative is replaced by the difference of two 
first derivatives. 

Again, since this method does not depend on values of f directly, it can be 
regarded as a method for solving f’(x) = g(x) = 0. Viewed in this way the method, 
which is illustrated in Fig. 8.6, takes the form 


Men =~ 80H) mass | " 


We next investigate the order of convergence of the method of false position 
and discover that it is order T, ~ 1.618, the golden mean. 


Proposition. Let g have a continuous second derivative and suppose x* is 
such that g(x*) =0, g’(x*) £0. Then for x, sufficiently close to x*, the sequence 
{x,}%9 generated by the method of false position (11) converges to x* with 
order T, = 1.618. 


Fig. 8.6 False position for solving equations 
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Proof. Introducing the notation 


g(b) — g(a) 
gla, b] = aa (12) 
—a 
we have 
yy — XP Sy" «(vd Ah 
8(X;) — B(%-1) 
= (x, vo) | Sel es a (13) 
8X1 Xe] 
Further, upon the introduction of the notation 
’ b ~ b, 
De a ed oe 
a-—c 
we may write (13) as 
* * ae BX, Xp X"] 
Keg X= (%, — 2") Hp-1 “| oe |: 
8X1 %] 
Now, by the mean value theorem with remainder, we have (see Exercise 2) 
elt, %] = 8 (&) (14) 
and 
BlXi1, % x)= +g”(nk), (15) 


where &, and 9, are convex combinations of x,, x,_, and x,, x,_,, x*, respec- 
tively. Thus 


»_ & (Mm) 
28'(Ei) 


It follows immediately that the process converges if it is started sufficiently close 
to x*. 

To determine the order of convergence, we note that for large k Eq. (16) 
becomes approximately 


(%, — x") (4-1 — 2"). (16) 


Xep1 —% 


Xep1 —X" = M(x, — x" )(-1 — ¥"), 


where 


_ 8) 


2g'(x*) 
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Thus defining €, = (x, —x*) we have, in the limit, 
Exp = Me, ey). (17) 
Taking the logarithm of this equation we have, with y, = log Me,, 


Veri = Ve Vets (18) 


which is the Fibonacci difference equation discussed in Section 7.1. A solution to 
this equation will satisfy 


Veet — TY, > O. 
Thus 


Mex 


log Me,,, —T,logMe, > 0 or log (Me, => 0, 
E,)™! 


and hence 


Ey 
AN a) 
1 
Ex 


Having derived the error formula (17) by direct analysis, it is now appropriate 
to point out a short-cut technique, based on symmetry and other considerations, 
that can sometimes be used in even more complicated situations. The right side of 
error formula (17) must be a polynomial in ¢, and ¢,_,, since it is derived from 
approximations based on Taylor’s theorem. Furthermore, it must be second order, 
since the method reduces to Newton’s method when x, = x,_,. Also, it must go 
to zero if either €, or €,_, go to zero, since the method clearly yields ¢,,, = 0 in 
that case. Finally, it must be symmetric in ¢, and €,_,, since the order of points is 
irrelevant. The only formula satisfying these requirements is €,,; = Meé,€,_1. 


Cubic Fit 


Given the points x,_, and x, together with the values f(x,_,), f’(%_1), f(x); 
f'(x,); it is possible to fit a cubic equation to the points having corresponding 
values. The next point x,,, can then be determined as the relative minimum point 
of this cubic. This leads to 


Sf! (%q) + ly — Uy (19) 


f'(%4) — Ff! (Xp-1) + 2uy 


Naty = XE (x; nv) 


where 


f(Xe-1) — F(%) 


XK-1 k 


Uy =f (1) +f (x) —3 


Uy = [ui = Par Cals 


which is easily implementable for computations. 
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It can be shown (see Exercise 3) that the order of convergence of the cubic fit 
method is 2.0. Thus, although the method is exact for cubic functions indicating 
that its order might be three, its order is actually only two. 


Quadratic Fit 


The scheme that is often most useful in line searching is that of fitting a quadratic 
through three given points. This has the advantage of not requiring any derivative 
information. Given x,, x,, x3 and corresponding values f(x,) = f;, f(x.) = 
to, f(x3) = fy we construct the quadratic passing through these points 


Tizi(*— *)) 


q(x) = een Sy 


(20) 


and determine a new point x, as the point where the derivative of g vanishes. Thus 


1 bos fi tbs fot+Oirhs 
2 af, + 31 fo + ahs ; 


(21) 


x= 


ie 1g 259 
where a;, = x;—x;, bj =X; — Xj. 


Define the errors e,; = x*—x,, i= 1, 2, 3, 4. The expression for ¢, must be 
a polynomial in €,, €, &. It must be second order (since it is a quadratic fit). 
It must go to zero if any two of the errors €,, €, & is zero. (The reader should 
check this.) Finally, it must be symmetric (since the order of points is relevant). It 
follows that near a minimum point x* of f, the errors are related approximately by 


&4 = M(e)&) + €)&3 + €)€3), (22) 
where M depends on the values of the second and third derivatives of f at x*. 
If we assume that ¢, — 0 with an order greater than unity, then for large k the 
error is governed approximately by 
Exp. = Me, ey). 
Letting y, = log Me, this becomes 
Yes = Vet Vet 
with characteristic equation 


M—-A-1=0. 


The largest root of this equation is \ ~ 1.3 which thus determines the rate of growth 
of y, and is the order of convergence of the quadratic fit method. 
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Approximate Methods 


In practice line searches are terminated before they have converged to the actual 
minimum point. In one method, for example, a fairly large value for x, is chosen 
and this value is successively reduced by a positive factor B < 1 until a sufficient 
decrease in the function value is obtained. Approximate methods and suitable 
stopping criteria are discussed in Section 8.5. 


8.3 GLOBAL CONVERGENCE OF CURVE FITTING 


Above, we analyzed the convergence of various curve fitting procedures in the 
neighborhood of the solution point. If, however, any of these procedures were 
applied in pure form to search a line for a minimum, there is the danger—alas, the 
most likely possibility—that the process would diverge or wander about meaning- 
lessly. In other words, the process may never get close enough to the solution for 
our detailed local convergence analysis to be applicable. It is therefore important to 
artfully combine our knowledge of the local behavior with conditions guaranteeing 
global convergence to yield a workable and effective procedure. 

The key to guaranteeing global convergence is the Global Convergence 
Theorem of Chapter 7. Application of this theorem in turn hinges on the 
construction of a suitable descent function and minor modifications of a pure curve 
fitting algorithm. We offer below a particular blend of this kind of construction 
and analysis, taking as departure point the quadratic fit procedure discussed in 
Section 8.2 above. 

Let us assume that the function f that we wish to minimize is strictly unimodal 
and has continuous second partial derivatives. We initiate our search procedure by 
searching along the line until we find three points x,, x, x; with x, <x, < x3 such 
that f(x,) > f(x.) < f(x;). In other words, the value at the middle of these three 
points is less than that at either end. Such a sequence of points can be determined 
in a number of ways—see Exercise 7. 

The main reason for using points having this pattern is that a quadratic fit to 
these points will have a minimum (rather than a maximum) and the minimum point 
will lie in the interval [x,, x3]. See Fig. 8.7. We modify the pure quadratic fit 
algorithm so that it always works with points in this basic three-point pattern. 

The point x, is calculated from the quadratic fit in the standard way and f(x,) 
is measured. Assuming (as in the figure) that x, < x, < x3, and accounting for the 
unimodal nature of f, there are but two possibilities: 


1. f(%4) < f(x) 
2. f(x) < fxs) < f(xs). 


In either case a new three-point pattern, x,, X,, X3, involving x, and two of the 
old points, can be determined: In case (1) it is 


(x1, Xp, X3) = (%, X45 X3), 


8.3 Global Convergence of Curve Fitting 227 


| | | os 
xy x2 x3 


Fig. 8.7 Three-point pattern 


while in case (2) it is 


(x1, Xp, X3) = (x, X45 X4). 


We then use this three-point pattern to fit another quadratic and continue. The pure 
quadratic fit procedure determines the next point from the current point and the 
previous two points. In the modification above, the next point is determined from 
the current point and the two out of three last points that form a three-point pattern 
with it. This simple modification leads to global convergence. 

To prove convergence, we note that each three-point pattern can be thought 
of as defining a vector x in E*. Corresponding to an x = (x,, x, x3) such that 
(x;, X2, x3) form a three-point pattern with respect to f, we define A(x) = 
(X,, X2, X3) as discussed above. For completeness we must consider the case where 
two or more of the x,, i= 1, 2, 3 are equal, since this may occur. The appropriate 
definitions are simply limiting cases of the earlier ones. For example, if x, = x,, 
then (x,, x,, x3) form a three-point pattern if f(x,) < f(x;) and f’(x,) <0 (which 
is the limiting case of f(x,) < f(x,)). A quadratic is fit in this case by using the 
values at the two distinct points and the derivative at the duplicated point. In case 
X, =X, =X3, (X), X), X;) forms a three-point pattern if f’(x,) = 0 and f”(x,) > 0. 
With these definitions, the map A is well defined. It is also continuous, since curve 
fitting depends continuously on the data. 

We next define the solution set ! C E? as the points x* = (x*, x*, x*) where 
f'(*) =0. 

Finally, we let Z(x) = f(x,)+ f(%))+ f(%3). It is easy to see that Z is a descent 
function for A. After application of A one of the values f(x,), f(%), f(x3) will 
be replaced by f(x,), and by construction, and the assumption that f is unimodal, 
it will replace a strictly larger value. Of course, at x* = (x*, x*, x*) we have 
A(x*) = x* and hence Z(A(x*)) = Z(x*). 

Since all points are contained in the initial interval, we have all the requirements 
for the Global Convergence Theorem. Thus the process converges to the solution. 
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The order of convergence may not be destroyed by this modification, if near the 
solution the three-point pattern is always formed from the previous three points. In 
this case we would still have convergence of order 1.3. This cannot be guaranteed, 
however. 

It has often been implicitly suggested, and accepted, that when using the 
quadratic fit technique one should require 


S(%41) < fO%%) 


so as to guarantee convergence. If the inequality is not satisfied at some cycle, then a 
special local search is used to find a better x,,, that does satisfy it. This philosophy 
amounts to taking Z(x) = f(x,) in our general framework and, unfortunately, this 
is not a descent function even for unimodal functions, and hence the special local 
search is likely to be necessary several times. It is true, of course, that a similar 
special local search may, occasionally, be required for the technique we suggest in 
regions of multiple minima, but it is never required in a unimodal region. 

The above construction, based on the pure quadratic fit technique, can be 
emulated to produce effective procedures based on other curve fitting techniques. 
For application to smooth functions these techniques seem to be the best available in 
terms of flexibility to accommodate as much derivative information as is available, 
fast convergence, and a guarantee of global convergence. 


8.4 CLOSEDNESS OF LINE SEARCH 
ALGORITHMS 


Since searching along a line for a minimum point is a component part of most 
nonlinear programming algorithms, it is desirable to establish at once that this 
procedure is closed; that is, that the end product of the iterative procedures outlined 
above, when viewed as a single algorithmic step finding a minimum along a line, 
define closed algorithms. That is the objective of this section. 

To initiate a line search with respect to a function f, two vectors must be 
specified: the initial point x and the direction d in which the search is to be made. 
The result of the search is a new point. Thus we define the search algorithm S as a 
mapping from E?” to E”. 

We assume that the search is to be made over the semi-infinite line emanating 
from x in the direction d. We also assume, for simplicity, that the search is not 
made in vain; that is, we assume that there is a minimum point along the line. This 
will be the case, for instance, if f is continuous and increases without bound as x 
tends toward infinity. 


Definition. The mapping S : E*" > E” is defined by 


S(x, d) = {y:y =x+ad for some a>0, f(y) = jinin t(x+ad)}. (23) 


In some cases there may be many vectors y yielding the minimum, so S is a 
set-valued mapping. We must verify that S is closed. 
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Theorem. Let f be continuous on E". Then the mapping defined by (23) is 
closed at (x, d) ifd #9. 


Proof. Suppose {x;,} and {d,} are sequences with x, > x, d, > d# 0. Suppose 
also that y, € S(x,, d,) and that y, — y. We must show that y € S(x, d). 


For each k we have y, = x,+a,d, for some a,. From this we may write 
lye — Xe 
id, 


a,= 


Taking the limit of the right-hand side of the above, we see that 


ly=x| 
\d| 


a,- a= 


It then follows that y = x+ ad. It still remains to be shown that y € S(x, d). 
For each k and each a, O< a<oo, 


S(¥K) < fx + ad;). 
Letting k — co we obtain 


fly) < f(x+ad). 
Thus 


f(y) < min f(x+ ad), 
and hence y € S(x, d).] 


The requirement that d 0 is natural both theoretically and practically. From 
a practical point of view this condition implies that, when constructing algorithms, 
the choice d = 0 had better occur only in the solution set; but it is clear that if 
d = 0, no search will be made. Theoretically, the map S can fail to be closed at 
d = 0, as illustrated below. 


Example. On E! define f(x) = (x—1)*. Then S(x, d) is not closed at x =0, d=0. 
To see this we note that for any d > 0 


jm, hed) = f(1), 


and hence 
S(O, d)=1; 
but 
gun ey a) 
so that 


S(O, 0) =0. 
Thus as d—> 0, S(0, d) & S(0, 0). 
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8.5 INACCURATE LINE SEARCH 


In practice, of course, it is impossible to obtain the exact minimum point called for 
by the ideal line search algorithm S described above. As a matter of fact, it is often 
desirable to sacrifice accuracy in the line search routine in order to conserve overall 
computation time. Because of these factors we must, to be realistic, be certain, at 
every stage of development, that our theory does not crumble if inaccurate line 
searches are introduced. 

Inaccuracy generally is introduced in a line search algorithm by simply termi- 
nating the search procedure before it has converged. The exact nature of the 
inaccuracy introduced may therefore depend on the particular search technique 
employed and the criterion used for terminating the search. We cannot develop a 
theory that simultaneously covers every important version of inaccuracy without 
seriously detracting from the underlying simplicity of the algorithms discussed 
later. For this reason our general approach, which is admittedly more free-wheeling 
in spirit than necessary but thereby more transparent and less encumbered than a 
detailed account of inaccuracy, will be to analyze algorithms as if an accurate line 
search were made at every step, and then point out in side remarks and exercises 
the effect of inaccuracy. 

In the remainder of this section we present some commonly used criteria for 
terminating a line search. 


Percentage Test 


One important inaccurate line search algorithm is the one that determines the 
search parameter a to within a fixed percentage of its true value. Specifically, a 
constant c, 0 <c <1 is selected (c = 0.10 is reasonable) and the parameter a 
in the line search is found so as to satisfy ja—a|< ca where @ is the true 
minimizing value of the parameter. This criterion is easy to use in conjunction 
with the standard iterative search techniques described in the first sections of this 
chapter. For example, in the case of the quadratic fit technique using three-point 
patterns applied to a unimodal function, at each stage it is known that the true 
minimum point lies in the interval spanned by the three-point pattern, and hence 
a bound on the maximum possible fractional error at that stage is easily deduced. 
One iterates until this bound is no greater than c. It can be shown (see Exercise 13) 
that this algorithm is closed. 


Armijo’s Rule 


A practical and popular criterion for terminating a line search is Armijo’s rule. The 
essential idea is that the rule should first guarantee that the selected @ is not too 
large, and next it should not be too small. Let us define the function 


b(a) = f(x + ad,). 


Armijo’s rule is implemented by consideration of the function (0) + e¢'(0)a@ for 
fixed ¢, 0 < e < 1. This function is shown in Fig. 8.8(a) as the dashed line. A value 
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Fig. 8.8 Stopping rules 


of @ is considered to be not too large if the corresponding function value lies below 
the dashed line; that is, if 


$(a) < 60) + ed'(O)a. (24) 
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To insure that @ is not too small, a value 7 > | is selected, and a is then considered 
to be not too small if 


b(na) > $(0) + e'(0)na. 


This means that if @ is increased by the factor 1, it will fail to meet the test (24). 
The acceptable region defined by the Armijo rule is shown in Fig. 8.8(a) when 
n= 2. 

Sometimes in practice, the Armijo test is used to define a simplified line search 
technique that does not employ curve fitting methods. One begins with an arbitrary 
a. If it satisfies (24), it is repeatedly increased by n (yn = 2 or n= 10 and e€= .2 
are often used) until (24) is not satisfied, and then the penultimate a is selected. If, 
on the other hand, the original a does not satisfy (24), it is repeatedly divided by 
7 until the resulting @ does satisfy (24). 


Goldstein Test 


Another line search accuracy test that is frequently used is the Goldstein test. As in 
the Armijo rule, a value of @ is considered not too large if it satisfies (24), with a 
given e, 0 < e < (1/2). A value of a is considered not too small in the Goldstein 
test if 


$(a) > $(0) + —e)6'(O)a. (25) 


In other words ¢(@) must lie above the lower dashed line shown in Fig. 8.8(b). 
In terms of the original notation, the Goldstein criterion for an acceptable value 
of a, with corresponding x,,, = X; + ad,, is 


ac f(Xe1) — FX) 24 
av f(x,)d; 


We now show that the Goldstein test leads to a closed line search algorithm. 


Theorem. Let f € C* on E". Fix &, 0 < € < 1/2. Then the mapping S : E?" > 
E" defined by 


S(x, d) = {y:y=x-+ad for somea>0, e<: 


is closed at (x, d) ifd+# 0. 


Proof. Suppose {x,} and {d,} are sequences with x, > x, d, > d+ 0. Suppose 
also that y, € S(x,, d,) and y, > y. We must show y € S(x, d). For each k, y, = 
x, + a,d, for some a,. Thus 


-_ lV, — Xx ly—x| _ 
a, = > =a 


id,.| id 
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Hence a, converges to some a@ and y=x-+ ad. Let 


_ f(xt+ad)— f(x) 
o(x, d, a)= ava 


Then ¢ < $(x,, d,, @,) < 1—e for all k. By our assumptions on f(x), ¢ is 
continuous. Thus (x,, d,, @,) > o(x, d, a) ande< f(x, d, a) < 1—e, which 
implies y € S(x, d).] 


Wolfe Test 


If derivatives of the objective function, as well as its values, can be evaluated 
relatively easily, then the Wolfe test, which is a variation of the above, is sometimes 
preferred. In this case e is selected with 0 < ¢ < 1/2, and q@ is required to 
satisfy (24) and 


$'(a) 2 (1—«)¢'(0). 


This test is illustrated in Fig. 8.8(c). An advantage of this test is that this last 
criterion is invariant to scale-factor changes, whereas (25) in the Goldstein test 
is not. 


Backtracking 


A simplified method of line search is available when a good estimate of a suitable 
step length is available. This is the case for the multi-dimensional Newton’s method 
for minimization discussed in the next chapter. Here a good initial choice is a= 1. 
Backtracking is defined by the initial guess a and two positive parameters n > | 
and € < | (usually e < .5). The stopping criterion used is the same as the first part 
of Amijo’s rule or the Goldstein test. That is, defining #(a) = f(x, +ad,), the 
procedure is terminated at the current a if (a) < 6(0)+ e¢’(0)a. If this criterion 
is not satisfied, then a@ is reduced by the factor 1/7. That is, @,.., = Qq/n. Often 
7 of about 1.1 or 1.2 is used. 

If the intial @ (such as a = 1) satisfies the test, then it is taken as the step size. 
Otherwise, a@ is reduced by 1/7. Repeating this successively, the first a that satisfies 
the test is declared the final value. By definition it is known that the previous value 
Qoiq = AnewN does not pass the first test, and this means that it passes the second 
condition of Amijo’s rule. 


new 


8.6 THE METHOD OF STEEPEST DESCENT 


One of the oldest and most widely known methods for minimizing a function of 
several variables is the method of steepest descent (often referred to as the gradient 
method). The method is extremely important from a theoretical viewpoint, since 
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it is one of the simplest for which a satisfactory analysis exists. More advanced 
algorithms are often motivated by an attempt to modify the basic steepest descent 
technique in such a way that the new algorithm will have superior convergence 
properties. The method of steepest descent remains, therefore, not only the technique 
most often first tried on a new problem but also the standard of reference against 
which other techniques are measured. The principles used for its analysis will be 
used throughout this book. 


The Method 


Let f have continuous first partial derivatives on E”. We will frequently have need 
for the gradient vector of f and therefore we introduce some simplifying notation. 
The gradient V f(x) is, according to our conventions, defined as a n-dimensional row 
vector. For convenience we define the n-dimensional column vector g(x) = V(x)’. 
When there is no chance for ambiguity, we sometimes suppress the argument x 
and, for example, write g, for g(x,) = Vf(x,)’. 

The method of steepest descent is defined by the iterative algorithm 


Key = Xe Sy» 


where @, is a nonnegative scalar minimizing f(x, —a@g,). In words, from the point 
x, we search along the direction of the negative gradient —g, to a minimum point 
on this line; this minimum point is taken to be x, ;. 

In formal terms, the overall algorithm A: E” + E” which gives x,,, € A(x;,) 
can be decomposed in the form A = SG. Here G: E" > E?" is defined by G(x) = 
(x, —g(x)), giving the initial point and direction of a line search. This is followed 
by the line search S : E*" + E” defined in Section 8.4. 


Global Convergence 


It was shown in Section 8.4 that S is closed if V(x) 4 0, and it is clear that G is 
continuous. Therefore, by Corollary 2 in Section 7.7 A is closed. 

We define the solution set to be the points x where V f(x) = 0. Then Z(x) = f(x) 
is a descent function for A, since for V f(x) 4 0 


gee f(x — ag(x)) < f(x). 


Thus by the Global Convergence Theorem, if the sequence {x,} is bounded, it will 
have limit points and each of these is a solution. 
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The Quadratic Case 


Essentially all of the important local convergence characteristics of the method of 
steepest descent are revealed by an investigation of the method when applied to 
quadratic problems. Consider 


f(x) = $x"Qx —x’b, (26) 


where Q is a positive definite symmetric n x n matrix. Since Q is positive definite, 
all of its eigenvalues are positive. We assume that these eigenvalues are ordered: 0 < 
a=X, <A)... <A, =A. With Q positive definite, it follows (from Proposition 5, 
Section 7.4) that f is strictly convex. 

The unique minimum point of f can be found directly, by setting the gradient 
to zero, as the vector x* satisfying 


Qx* = b. (27) 
Moreover, introducing the function 
E(x) = 3(x—x")'Q(x—x*), (28) 


we have E(x) = f(x) + (1/2)x*"’ Qx*, which shows that the function E differs from 
f only by a constant. For many purposes then, it will be convenient to consider 
that we are minimizing E rather than f. 

The gradient (of both f and £) is given explicitly by 


g(x) = Qx—b. (29) 
Thus the method of steepest descent can be expressed as 
Rip = Xe Ae Bx (30) 


where g, = Qx, —b and where a, minimizes f(x, — a@g,). We can, however, in this 
special case, determine the value of a, explicitly. We have, by definition (26), 


S(&, — @8,) = + (X, — ag,)' Q(x, — ag,) — (x, — ag,)"b, 


which (as can be found by differentiating with respect to a) is minimized at 


BiB 
= 3 (31) 
“gf Qe; 
Hence the method of steepest descent (30) takes the explicit form 
Bi 8 
Mp1 = Xe — (Fe) 8k (32) 


where g, = Qx, — b. 
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Xo 


Fig. 8.9 Steepest descent 


The function f and the steepest descent process can be illustrated as in Fig. 8.9 
by showing contours of constant values of f and a typical sequence developed 
by the process. The contours of f are n-dimensional ellipsoids with axes in the 
directions of the n-mutually orthogonal eigenvectors of Q. The axis corresponding 
to the ith eigenvector has length proportional to 1/A;. We now analyze this process 
and show that the rate of convergence depends on the ratio of the lengths of the 
axes of the elliptical contours of f, that is, on the eccentricity of the ellipsoids. 


Lemma 1. The iterative process (32) satisfies 


(2/8) 
Qg,) (gi Q-'g,) 


E(%41) = {! : | E(s) (33) 
(g; 


Proof. The proof is by direct computation. We have, setting y, = x, — x", 


E(x,) — E(X,41) = 20,8; Qy, = arg, Og, 
E(x,) y, Qy, 


Using g, = Qy, we have 


2(gi ge) (Bf 8)" 
E(x) — E(&ey1) (g,Qg.) (g/Qg,) 
E(x) 7 2 Q'g, 
= (87 84)" 
(gf Qe, (iQ 'B:) 


In order to obtain a bound on the rate of convergence, we need a bound on 
the right-hand side of (33). The best bound is due to Kantorovich and his lemma, 
stated below, is a useful general tool in convergence analysis. 
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Kantorovich inequality: Let Q be a positive definite symmetric n x n matrix. 
For any vector x there holds 


(x?x)? 4aA 
(x7Qx)(x7Q-!x) ~ (a+ A)?’ 


(34) 
where a and A are, respectively, the smallest and largest eigenvalues of Q. 
Proof. Let the eigenvalues A,, Az,..., A,, of Q satisfy 
O<a=h, <<... <4, =A. 


By an appropriate change of coordinates the matrix Q becomes diagonal with 
diagonal (A,, A,,..., A,,). In this coordinate system we have 


(x"x)? 7 (Dien 37) 
(x7Qx)(x7Q"'x) (LL NAF) F/N))” 


which can be written as 


(x’x)? = ei EX; _ o(€) 
(x7Qx)(x7Q"'x) VLE /A) WE)’ 


where &; = x?/ 0, x7. We have converted the expression to the ratio of two 
functions involving convex combinations; one a combination of A;’s; the other a 
combination of 1/A,’s. The situation is shown pictorially in Fig. 8.10. The curve 
in the figure represents the function 1/A. Since }7_, €;\,; is a point between A, 
and A,,, the value of #(€) is a point on the curve. On the other hand, the value of 
W(€) is a convex combination of points on the curve and its value corresponds to 
a point in the shaded region. For the same vector € both functions are represented 
by points on the same vertical line. The minimum value of this ratio is achieved 
for some A= €,A,+€,A,, with €,+&, = 1. Using the relation €,/A,+€,/A, = 
(A, +A, — &A, — €,A,)/A, A, an appropriate bound is 


Ps. ties um” 
AE) ~ Meher, OFA, —A)/OGA,) 


The minimum is achieved at A = (A, +A,,)/2, yielding 


(€) > 4X), 


WO? Oth? 


Combining the above two lemmas, we obtain the central result on the conver- 
gence of the method of steepest descent. 
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ow 


Fig. 8.10 Kantorovich inequality 


Theorem. (Steepest descent—quadratic case). For any X) € E” the method 
of steepest descent (32) converges to the unique minimum point x* of f. 
Furthermore, with E(x) = 5 (x —x*)’Q(x — x*), there holds at every step k 


F(X) < (4-4) eto. (35) 


Proof. By Lemma | and the Kantorovich inequality 


4aA A-a\* 
E(X41) < {1 - ao E(x) = (=) E(x;,). 


It follows immediately that E(x,) — 0 and hence, since Q is positive definite, that 
x, > x*. J 


Roughly speaking, the above theorem says that the convergence rate of steepest 
descent is slowed as the contours of f become more eccentric. If a = A, corre- 
sponding to circular contours, convergence occurs in a single step. Note, however, 
that even if n—1 of the nm eigenvalues are equal and the remaining one is a 
great distance from these, convergence will be slow, and hence a single abnormal 
eigenvalue can destroy the effectiveness of steepest descent. 

In the terminology introduced in Section 7.8, the above theorem states that 
with respect to the error function E (or equivalently f) the method of steepest 
descent converges linearly with a ratio no greater than [(A—a)/(A+a)]*. The 
actual rate depends on the initial point x). However, for some initial points the 
bound is actually achieved. Furthermore, it has been shown by Akaike that, if the 
ratio is unfavorable, the process is very likely to converge at a rate close to the 
bound. Thus, somewhat loosely but with reasonable justification, we say that the 
convergence ratio of steepest descent is [(A—a)/(A+a)]’. 
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It should be noted that the convergence rate actually depends only on the ratio 
r = A/a of the largest to the smallest eigenvalue. Thus the convergence ratio is 


A-a\’ frat : 
Ata) \r4+1)’ 
which clearly shows that convergence is slowed as r increases. The ratio r, which 


is the single number associated with the matrix Q that characterizes convergence, 
is often called the condition number of the matrix. 


Example. Let us take 


0.78 —0.02 —0.12 —0.14 
—0.02 0.86 —0.04 0.06 
—0.12 —0.04 0.72 —0.08 
—0.14 0.06 —0.08 0.74 


b= (0.76, 0.08, 1.12, 0.68). 


Q= 


For this matrix it can be calculated that a= 0.52, A =0.94 and hence r= 1.8. 
This is a very favorable condition number and leads to the convergence ratio 
[(A—a)/(A+a)}’ = 0.081. Thus each iteration will reduce the error in the objective 
by more than a factor of ten; or, equivalently, each iteration will add about one 
more digit of accuracy. Indeed, starting from the origin the sequence of values 
obtained by steepest descent as shown in Table 8.1 is consistent with this estimate. 


The Nonquadratic Case 


For nonquadratic functions, we expect that steepest descent will also do reasonably 
well if the condition number is modest. Fortunately, we are able to establish 
estimates of the progress of the method when the Hessian matrix is always positive 
definite. Specifically, we assume that the Hessian matrix is bounded above and 
below as aI < F(x) < AI. (Thus f is strongly convex.) We present three analyses: 


Table 8.1 Solution to Example 


Step k F(%,) 

0 0 

1 —2.1563625 
2 —2.1744062 
3 —2.1746440 
4 —2.1746585 
5 —2.1746595 
6 —2.1746595 


Solution point x* = (1.534965, 0.1220097, 
1.975156, 1.412954) 
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1. Exact line search. Given a point x,, we have for any a 


f(x, — a8 (%,)) < f(K,)) - ag(x,)'g(x;,) + AO a(x)" 80%). (36) 


Minimizing both sides separately with respect to a the inequality will hold for the 
two minima. The minimum of the left hand side is f(x,,,). The minimum of the 
right hand side occurs at a = 1/A, yielding the result 


Posen) < fx) — srlac)l 


where |g(x,)|? = g(x,)/g(x,). Subtracting the optimal value f* = f(x*) from both 
sides produces 


Pose) £00) = 5 le P. G7) 


In a similar way, for any x there holds 
a 2 
F(x) > fq) + 80%)" (K—%,) + ia Xl" 


Again we can minimize both sides separately. The minimum of the left hand side is 
f* the optimal solution value. Minimizing the right hand side leads to the quadratic 
optimization problem. The solution is x = x, — g(x,)/a. Substituting this x in the 
right hand side of the inequality gives 


1 2 
f° 2 F&) — Flee" (38) 
From (38) we have 


—lg(x,)|’ < 2aLf* — f(x,)]. (39) 


Substituting this in (37) gives 


fui d- LK < UA — a/AF&) — FI (40) 


This shows that the method of steepest descent makes progress even when it is not 
close to the solution. 


2. Other stopping criteria. As an example of how other stopping criteria can 
be treated, we examine the rate of convergence when using Amijo’s rule with ¢ < .5 
and 7 > 1. Note first that the inequality t > f° for 0 <1t< 1 implies by a change of 
variable that 


aA 
-a+ oF <-a/2 
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for 0 < a<1/A. Then using (36) we have that for a < 1/A 


S(%& — ag(X,)) < f(%) — alex)? + -5a°Alg(x,)|? 
< f(x,) — Salg(x,)|? 
< f(x) - ea|g(x,)|” 


since ¢ < .5. This means that the first part of the stopping criterion is satisfied 
for a < 1/A. 

The second part of the stopping criterion states that na does not satisfy the first 
criterion and thus the final a must satisfy a > 1/()A). Therefore the inequality of 
the first part of the criterion implies 


E 
F(®i1) Sf) — 7a BOW: 
Subtracting f* from both sides, 
* ok é 2 
FR) — FS $&)-— FF - SF l8&)I- 
nA 
Finally, using (39) we obtain 


FR) — fF Ss [1- Zea/n A) (f&) — f). 


Clearly 2¢a/nA < | and hence there is linear convergence. Notice if that in fact ¢ is 
chosen very close to .5 and 77 is chosen very close to |, then the stopping condition 
demands that the a be restricted to a very small range, and the estimated rate of 
convergence is very close to the estimate obtained above for exact line search. 


3. Asymptotic convergence. We expect that as the points generated by steepest 
descent approach the solution point, the convergence characteristics will be close 
to those inherent for quadratic functions. This is indeed the case. 

The general procedure for proving such a result, which is applicable to most 
methods having unity order of convergence, is to use the Hessian of the objective at 
the solution point as if it were the Q matrix of a quadratic problem. The particular 
theorem stated below is a special case of a theorem in Section 12.5 so we do 
not prove it here; but it illustrates the generalizability of an analysis of quadratic 
problems. 


Theorem. Suppose f is defined on E", has continuous second partial deriva- 
tives, and has a relative minimum at x*. Suppose further that the Hessian matrix 
of f, F(x*), has smallest eigenvalue a > 0 and largest eigenvalue A > 0. If 
{x,} is a sequence generated by the method of steepest descent that converges 
to x*, then the sequence of objective values { f(x;,)} converges to f(x*) linearly 
with a convergence ratio no greater than [(A—a)/(A+a))’. 
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8.7 APPLICATIONS OF THE THEORY 


Now that the basic convergence theory, as represented by the formula (35) for the 
rate of convergence, has been developed and demonstrated to actually characterize 
the behavior of steepest descent, it is appropriate to illustrate how the theory can 
be used. Generally, we do not suggest that one compute the numerical value of the 
formula—since it involves eigenvalues, or ratios of eigenvalues, that are not easily 
determined. Nevertheless, the formula itself is of immense practical importance, 
since it allows one to theoretically compare various situations. Without such a 
theory, one would be forced to rely completely on experimental comparisons. 


Application 1 (Solution of gradient equation). One approach to the minimization 
of a function f is to consider solving the equations V/(x) = 0 that represent the 
necessary conditions. It has been proposed that these equations could be solved 
by applying steepest descent to the function h(x) =|V/f(x)|?. One advantage of 
this method is that the minimum value is known. We ask whether this method is 
likely to be faster or slower than the application of steepest descent to the original 
function f itself. 

For simplicity we consider only the case where f is quadratic. Thus let f(x) = 
(1/2)x? Qx — b’x. Then the gradient of f is g(x) = Qx—b, and h(x) = |g(x)|? = 
x’ Q’x — 2x’Qb+ b’b. Thus /(x) is itself a quadratic function. The rate of conver- 
gence of steepest descent applied to h will be governed by the eigenvalues of the 
matrix Q’. In particular the rate will be 


r=" 
( r+l ) : 
where 7 is the condition number of the matrix Q”. However, the eigenvalues of Q’ 
are the squares of those of Q itself, so 7 = r?, where r is the condition number of 
Q, and it is clear that the convergence rate for the proposed method will be worse 
than for steepest descent applied to the original function. 


We can go further and actually estimate how much slower the proposed method 
is likely to be. If r is large, we have 


r—1\? 4 
steepest descent rate = | —— ] ~(1-—1/r) 
r+1 


2 


rP—1\? rw 
proposed method rate = ~(-1/r’). 
r+ 


Since (1—1/r?)" ~ 1—1/r, it follows that it takes about r steps of the new method 
to equal one step of ordinary steepest descent. We conclude that if the original 
problem is difficult to solve with steepest descent, the proposed method will be 
quite a bit worse. 
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Application 2 (Penalty methods). Let us briefly consider a problem with a single 
constraint: 


minimize f(x) (41) 
subject to h(x) =0. 


One method for approaching this problem is to convert it (at least approximately) 
to the unconstrained problem 


minimize f(x) + 4uh(x)’, (42) 


where yp is a (large) penalty coefficient. Because of the penalty, the solution to (42) 
will tend to have a small h(x). Problem (42) can be solved as an unconstrained 
problem by the method of steepest descent. How will this behave? 

For simplicity let us consider the case where f is quadratic and h is linear. 
Specifically, we consider the problem 


1 
minimize 5x Qx —b’x (43) 
subject to e’x=0. 


The objective of the associated penalty problem is (1/2){x?Qx + wx’ ce?x} — b? x. 
The quadratic form associated with this objective is defined by the matrix Q+ pec” 
and, accordingly, the convergence rate of steepest descent will be governed by 
the condition number of this matrix. This matrix is the original matrix Q with a 
large rank-one matrix added. It should be fairly clear’ that this addition will cause 
one eigenvalue of the matrix to be large (on the order of jw). Thus the condition 
number is roughly proportional to w. Therefore, as one increases yz in order to get 
an accurate solution to the original constrained problem, the rate of convergence 
becomes extremely poor. We conclude that the penalty function method used in this 
simplistic way with steepest descent will not be very effective. (Penalty functions, 
and how to minimize them more rapidly, are considered in detail in Chapter 11.) 


Scaling 


The performance of the method of steepest descent is dependent on the particular 
choice of variables x used to define the problem. A new choice may substantially 
alter the convergence characteristics. 

Suppose that T is an invertible n x n matrix. We can then represent points in 
E" either by the standard vector x or by y where Ty = x. The problem of finding 


¥See the Interlocking Eigenvalues Lemma in Section 10.6 for a proof that only one eigenvalue 
becomes large. 
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x to minimize f(x) is equivalent to that of finding y to minimize h(y) = f(Ty). 
Using y as the underlying set of variables, we then have 


Vh=VfT, (44) 


where Vf is the gradient of f with respect to x. Thus, using steepest descent, the 
direction of search will be 


Vy =-T'Vf', (45) 
which in the original variables is 
Ax=-TT’V/’. (46) 


Thus we see that the change of variables changes the direction of search. 

The rate of convergence of steepest descent with respect to y will be determined 
by the eigenvalues of the Hessian of the objective, taken with respect to y. That 
Hessian is 


Vh(y) = H(y) = T’F(Ty)T. 


Thus, if x* = Ty” is the solution point, the rate of convergence is governed by the 
matrix 


H(y*) = T'F(x°)T. (47) 


Very little can be said in comparison of the convergence ratio associated with 
H and that of F. If T is an orthonormal matrix, corresponding to y being defined 
from x by a simple rotation of coordinates, then T’T = I, and we see from (41) that 
the directions remain unchanged and the eigenvalues of H are the same as those 
of F. 

In general, before attacking a problem with steepest descent, it is desirable, 
if it is feasible, to introduce a change of variables that leads to a more favorable 
eigenvalue structure. Usually the only kind of transformation that is at all practical 
is one having T equal to a diagonal matrix, corresponding to the introduction 
of scale factors on each of the variables. One should strive, in doing this, to 
make the second derivatives with respect to each variable roughly the same. 
Although appropriate scaling can potentially lead to substantial payoff in terms of 
enhanced convergence rate, we largely ignore this possibility in our discussions of 
steepest descent. However, see the next application for a situation that frequently 
occurs. 


Application 3. (Program design). In applied work it is extremely rare that one 
solves just a single optimization problem of a given type. It is far more usual that 
once a problem is coded for computer solution, it will be solved repeatedly for 
various parameter values. Thus, for example, if one is seeking to find the optimal 
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production plan (as in Example | of Section 7.2), the problem will be solved for 
the different values of the input prices. Similarly, other optimization problems will 
be solved under various assumptions and constraint values. It is for this reason 
that speed of convergence and convergence analysis is so important. One wants a 
program that can be used efficiently. In many such situations, the effort devoted to 
proper scaling repays itself, not with the first execution, but in the long run. 

As a simple illustration consider the problem of minimizing the function 


f(x) =x? —S5xy+ y* —ax-— by. 


It is desirable to obtain solutions quickly for different values of the parameters a 
and b. We begin with the values a= 25, b=8. 

The result of steepest descent applied to this problem directly is shown in 
Table 8.2, column (a). It requires eighty iterations for convergence, which could be 
regarded as disappointing. 


Table 8.2 Solution to Scaling Application 


Value of f 

Iteration (a) (b) 

no. Unscaled Scaled 
0 0.0000 0.0000 
1 —230.9958 — 162.2000 
2 —256.4042 —289.3124 
4 —293.1705 —341.9802 
6 —313.3619 —342.9865 
8 —324.9978 —342.9998 
9 —329.0408 —343.0000 

15 —339.6124 

20 —341.9022 

25 —342.6004 

30 —342.8372 

35 —342.9275 

40 —342.9650 

45 —342.9825 

50 —342.9909 

55 —342.9951 

60 —342.9971 Solution 

65 —342.9883 x= 20.0 

70 —342.9990 y=3.0 

75 —342.9994 : 

80 —342.9997 
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The reason for this poor performance is revealed by examining the Hessian 


matrix 
2 —5 
= is | 


Using the results of our first experiment, we know that y = 3. Hence the diagonal 
elements of the Hessian, at the solution, differ by a factor of 54. (In fact, the 
condition number is about 61.) As a simple remedy we scale the problem by 
replacing the variable y by z = ty. The new lower right-corner term of the Hessian 
then becomes 12z7/t*, which has magnitude 12 x f? x 3?/t* = 108/1?. Thus we 
might put t = 7 in order to make the two diagonal terms approximately equal. 
The result of applying steepest descent to the problem scaled this way is shown in 
Table 8.2, column (b). (This superior performance is in accordance with our general 
theory, since the condition number of the scaled problem is about two.) For other 
nearby values of a and b, similar speeds will be attained. 


8.8 NEWTON’S METHOD 


The idea behind Newton’s method is that the function f being minimized is approx- 
imated locally by a quadratic function, and this approximate function is minimized 
exactly. Thus near x, we can approximate f by the truncated Taylor series 


F(X) = f(&) + VIR) (&— &) + +(x = X;,) F(x,) (x —X;). 
The right-hand side is minimized at 
Xi =X — (ER) VK) (48) 


and this equation is the pure form of Newton’s method. 

In view of the second-order sufficiency conditions for a minimum point, we 
assume that at a relative minimum point, x*, the Hessian matrix, F(x*), is positive 
definite. We can then argue that if f has continuous second partial derivatives, 
F(x) is positive definite near x* and hence the method is well defined near the 
solution. 


Order Two Convergence 


Newton’s method has very desirable properties if started sufficiently close to the 
solution point. Its order of convergence is two. 


Theorem. (Newton’s method). Let f € C? on E", and assume that at the 
local minimum point x*, the Hessian F(x*) is positive definite. Then if started 
sufficiently close to x*, the points generated by Newton’s method converge to 
x". The order of convergence is at least two. 
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Proof. There arep>0, 8, >0, B, > 0 such that for all x with |x —x*| < p, there 
holds |F(x)~'| < B, (see Appendix A for the definition of the norm of a matrix) 
and | V f(x*)? — V(x)’ — F(x)(x* —x)| < B.|x —x*|’. Now suppose x, is selected 
with B,B,|x, —x*| < 1 and |x, —x*| < p. Then 


Xp41 —X"| = |x, —x* — F(x,) 'Vf(x,)"| 
= |F(x,) '[V f(x")? — V(x)" — F(x,) (x* — x,)]| 
< [F(x,)"|Bo1x; =x'/? 


< B, Bo |x, —x*|? < |x, —x"]. 


The final inequality shows that the new point is closer to x* than the old point, and 
hence all conditions apply again to x,;,,. The previous inequality establishes that 
convergence is second order. J 


Modifications 


Although Newton’s method is very attractive in terms of its convergence properties 
near the solution, it requires modification before it can be used at points that are 
remote from the solution. The general nature of these modifications is discussed in 
the remainder of this section. 


1. Damping. The first modification is that usually a search parameter a is introduced 
so that the method takes the form 


Key = Xe Oe [F(x,)]}-' V(x)’, 


where a@, is selected to minimize f. Near the solution we expect, on the basis of how 
Newton’s method was derived, that a, ~ 1. Introducing the parameter for general 
points, however, guards against the possibility that the objective might increase 
with a, = 1, due to nonquadratic terms in the objective function. 


2. Positive definiteness. A basic consideration for Newton’s method can be seen 
most clearly by a brief examination of the general class of algorithms 


X,41 = X, — 2M,8,, (49) 


where M, is an n x n matrix, @ is a positive search parameter, and g, = V/f(x,)’. We 
note that both steepest descent (M, = I) and Newton’s method (M, = [F(x,)]~') 
belong to this class. The direction vector d, = —M,g, obtained in this way is a 
direction of descent if for small a the value of f decreases as a increases from 
zero. For small a@ we can say 


FX) = £O%) + VA) &ip — Xe) + OUK 1 =x,|?). 
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Employing (44) this can be written as 


F(Xe41) = f(%) — agi M,g; ar O(a’). 


As a — 0, the second term on the right dominates the third. Hence if one is to 
guarantee a decrease in f for small a, we must have g/M,g, > 0. The simplest 
way to insure this is to require that M, be positive definite. 

The best circumstance is that where F(x) is itself positive definite throughout 
the search region. The objective function of many important optimization problems 
have this property, including for example interior-point approaches to linear 
programming using the logarithm as a barrier function. Indeed, it can be argued that 
convexity is an inherent property of the majority of well-formulated optimization 
problems. 

Therefore, assume that the Hessian matrix F(x) is positive definite throughout 
the search region and that f has continuous third derivatives. At a given x, define 
the symmetric matrix T = F(x,)~'/?. As in section 8.7 introduce the change of 
variable Ty = x. Then according to (41) a steepest descent direction with respect 
to y is equivalent to a direction with respect to x of d = —TT’g(x,), where g(x,) 
is the gradient of f with respect to x at x,. Thus, d = F~'g(x,). In other words, a 
steepest descent direction in y is equivalent to a Newton direction in x. 

We can turn this relation around to analyze Newton steps in x as equivalent 
to gradient steps in y. We know that convergence properties in y depend on the 
bounds on the Hessian matrix given by (42) as 


H(y) = T’F(x)T =F? F(x)F'”. (50) 


Recall that F = F(x,) which is fixed, whereas F(x) denotes the general Hessian 
matrix with respect to x near x,. The product (50) is the identity matrix at y, but the 
rate of convergence of steepest descent in y depends on the bounds of the smallest 
and largest eigenvalues of H(y) in a region near y,. 

These observations tell us that the damped method of Newton’s method will 
converge at a linear rate at least as fast as c= (1 — a/A) where a and A are lower 
and upper bounds on the eigenvalues of F(x,)~'/7F(x°)F(x,)~!/”, where xp and x° 
are arbitrary points in the local search region. These bounds depend, in turn, on 
the bounds of the third-order derivatives of f. It is clear, however, by continuity of 
F(x) and its derivatives, that the rate becomes very fast near the solution, becoming 
superlinear, and in fact, as we know, quadratic. 


3. Backtracking. The backtracking method of line search, using a = | as the initial 
guess, is an attractive procedure for use with Newton’s method. Using this method 
the overall progress of Newton’s method divides naturally into two phases: first 
a damping phase where backtracking may require a < 1, and second a quadratic 
phase where a = | satisfies the backtracking criterion at every step. The damping 
phase was discussed above. 

Let us now examine the situation when close to the solution. We assume that all 
derivatives of f through the third are continuous and uniformly bounded. We also 
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assume that in the region close to the solution, F(x) is positive definite with a > 0 
and A > 0 being, respectively, uniform lower and upper bounds on the eigenvalues 
of F(x). Using a= 1 and ¢ <.5 we have for d, = —F(x,)~!g(x;,) 


f(x, +d) = f(&,) — g(x,)'F(x,) 'g(x,) + 52(x,)'F(x,) 'g(x;) + o(|g(x;,)|”) 
= f(x) — $8(x,)F(x,) ‘g(x,) + o(/g(%)1”) 
< f(x) — g(%,)F(x,) g(x) + o(|g(x,)”), 


where the o bound is uniform for all x,. Since |g(x;,)| — 0 (uniformly) as x, — x", it 
follows that once x, is sufficiently close to x*, then f(x, +d,) < f(x;,) — eg(x,)"d, 
and hence the backtracking test (the first part of Amijo’s rule) is satisfied. This 
means that a = | will be used throughout the final phase. 


4. General Problems. In practice, Newton’s method must be modified to accom- 
modate the possible nonpositive definiteness at regions remote from the solution. 

A common approach is to take M, = [e,I1+ F(x,)]"! for some non-negative 
value of ¢,. This can be regarded as a kind of compromise between steepest descent 
(e, very large) and Newton’s method (¢, = 0). There is always an e, that makes 
M, positive definite. We shall present one modification of this type. 

Let F, = F(x,). Fix a constant 6 > 0. Given x,, calculate the eigenvalues of F, 
and let ¢, be the smallest nonnegative constant for which the matrix ¢,I+F, has 
eigenvalues greater than or equal to 6. Then define 


d, =—(e,I+F,) 'g, (51) 
and iterate according to 
X41 =X, + ad, (52) 


where a, minimizes f(x,+ad,), a> 0. 

This algorithm has the desired global and local properties. First, since the 
eigenvalues of a matrix depend continuously on its elements, €, is a continuous 
function of x, and hence the mapping D: E” — E?” defined by D(x,) = (x;, d,) 
is continuous. Thus the algorithm A = SD is closed at points outside the solution 
set O = {x: Vf(x) = 0}. Second, since ¢,1+F, is positive definite, d, is a descent 
direction and thus Z(x) = f(x) is a continuous descent function for A. Therefore, 
assuming the generated sequence is bounded, the Global Convergence Theorem 
applies. Furthermore, if 6 > 0 is smaller than the smallest eigenvalue of F(x*), 
then for x, sufficiently close to x* we will have ¢, = 0, and the method reduces to 
Newton’s method. Thus this revised method also has order of convergence equal 
to two. 

The selection of an appropriate 6 is somewhat of an art. A small 6 means that 
nearly singular matrices must be inverted, while a large 6 means that the order two 
convergence may be lost. Experimentation and familiarity with a given class of 
problems are often required to find the best 6. 
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The utility of the above algorithm is hampered by the necessity to calculate 
the eigenvalues of F(x,), and in practice an alternate procedure is used. In one 
class of methods (Levenberg—Marquardt type methods), for a given value of &,, 
Cholesky factorization of the form ¢,1+ F(x,) =GG" (see Exercise 6 of Chapter 7) 
is employed to check for positive definiteness. If the factorization breaks down, 
€, is increased. The factorization then also provides the direction vector through 
solution of the equations GG" d, = g;, which are easily solved, since G is triangular. 
Then the value f(x,+d,) is examined. If it is sufficiently below f(x,), then x,,, is 
accepted and a new é,,, is determined. Essentially, ¢ serves as a search parameter 
in these methods. It should be clear from this discussion that the simplicity that 
Newton’s method first seemed to promise is not fully realized in practice. 


Newton’s Method and Logarithms 


Interior point methods of linear and nonlinear programming use barrier functions, 
which usually are based on the logarithm. For linear programming especially, this 
means that the only nonlinear terms are logarithms. Newton’s method enjoys some 
special properties in this case, 

To illustrate, let us apply Newton’s method to the one-dimensional problem 


min[tx — In x] (53) 
x 
where f¢ is a positive parameter. The derivative at x is 
1 
f@=t--, 
x 


and of course the solution is x* = 1/f, or equivalently 1 — tx* = 0. The second 
derivative is f” (x) = 1/x?. Denoting by x+ the result of one step of a pure Newton’s 
method (with step length equal to 1) applied to the point x, we find 


1 
xt=x—[f"'(x)]'f'() =x-2 (: ) =x—tr’ +x 
x 
= 2x — tx’. 
Thus 
l—txt = 1-2tx+0°fP = (1—tx)? (54) 
Therefore, rather surprisingly, the quadratic nature of convergence of (1 — tx) > 0 
is directly evident and exact. Expression (54) represents a reduction in the error 
magnitude only if |(1 —tx)| < 1, or equivalently, 0 < x < 2/t. If x is too large, 


then Newton’s method must be used with damping until the region 0 < x < 2/t is 
reached. From then on, a step size of 1 will exhibit pure quadratic error reduction. 
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Xe Xk41 xy 
Vt 


Fig. 8.11 Newton’s method applied to minimization of tx —Inx 


The situation is shown in Fig. 8.11. The graph is that of f’(x) = t—1/x. The 
root-finding form of Newton’s method (Section 8.2) is then applied to this function. 
At each point, the tangent line is followed to the x axis to find the new point. 
The starting value marked x, is far from the solution 1/t and hence following the 
tangent would lead to a new point that was negative. Damping must be applied at 
that starting point. Once a point x is reached with 0 < x < 1/t, all further points 
will remain to the left of 1/t and move toward it quadratically. 

In interior point methods for linear programming, a logarithmic barrier function 
is applied separately to the variables that must remain positive. The convergence 
analysis in these situations is an extension of that for the simple case given here, 
allowing for estimates of the rate of convergence that do not require knowledge of 
bounds of third-order derivatives. 


Self-Concordant Functions 


The special properties exhibited above for the logarithm have been extended to the 
general class of self-concordant functions of which the logarithm is the primary 
example. A function f defined on the real line is self-concordant if it satisfies 


Lf" (x) < 2f"(x)*”, (55) 


throughout its domain. It is easily verified that f(x) = — In x satisfies this inequality 
with equality for x > 0. 

Self-concordancy is preserved by the addition of an affine term since such a 
term does not affect the second or third derivatives. 

A function defined on E” is said to be self-concordant if it is self-concordant 
in every direction: that is if f(x+ ad) is self-concordant with respect to a for every 
d throughout the domain of /. 
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Self-concordant functions can be combined by addition and even by compo- 
sition with affine functions to yield other self-concordant functions. (See 
exercise 29.). For example the function 


m 


f(x) = — Yi In(b; — a7), 


often used in interior point methods for linear programming, is self-concordant. 
When a self-concordant function is subjected to Newton’s method, the quadratic 
convergence of final phase can be measured in terms of the function 


N(x) = [VAX F(x) 'VA(x)T', 


where as usual F(x) is the Hessian matrix of f at x. Then it can be shown that close 
to the solution 


2h (Kj.1) S (2A) - (56) 


Furthermore, in a backtracking procedure, estimates of both the stepwise progress 
in the damping phase and the point at which the quadratic phase begins can be 
expressed in terms of parameters that depend only on the backtracking parameters. 
Although, this knowledge does not generally influence practice, it is theoretically 
quite interesting. 


Example 1. (The logarithmic case). Consider the earlier example of f(x) = 
tx —Inx. There 


Max) = [FY /F"@)]? = [(t= 1/x)a] = [1 = a. 
Then (56) gives 
(=i) <2 =a). 


Actually, for this example, as we found in (54), the factor of 2 is not required. 


There is a relation between the analysis of self-concordant functions and our 
earlier convergence analysis. 

Recall that one way to analyze Newton’s method is to change variables from 
x to y according to y = [F(x)]-“/)x, where here x is a reference point and x is 
variable. The gradient with respect to y at y is then F(x) “/”) Vf(x), and hence the 
norm of the gradient at y is [Vé@)F(x) VAC)” = A(x). Hence it is perhaps 
not surprising that A(x) plays a role analogous to the role played by the norm of 
the gradient in the analysis of steepest descent. 
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8.9 COORDINATE DESCENT METHODS 


The algorithms discussed in this section are sometimes attractive because of their 
easy implementation. Generally, however, their convergence properties are poorer 
than steepest descent. 

Let f be a function on E” having continuous first partial derivatives. Given 
a point x = (x;, X,..., x,), descent with respect to the coordinate x; (i fixed) 
means that one solves 


minimize f(x,, x2, ..., Xp). 


Xi 


Thus only changes in the single component x; are allowed in seeking a new and 
better vector x. In our general terminology, each such descent can be regarded as a 
descent in the direction e; (or —e;) where e; is the ith unit vector. By sequentially 
minimizing with respect to different components, a relative minimum of f might 
ultimately be determined. 

There are a number of ways that this concept can be developed into a full 
algorithm. The cyclic coordinate descent algorithm minimizes f cyclically with 
respect to the coordinate variables. Thus x, is changed first, then x, and so forth 
through x,,. The process is then repeated starting with x, again. A variation of this is 
the Aitken double sweep method. In this procedure one searches over x,, X7,..., X,; 
in that order, and then comes back in the order x,_), X,_9,..., X,. These cyclic 
methods have the advantage of not requiring any information about Vf to determine 
the descent directions. 

If the gradient of f is available, then it is possible to select the order of descent 
coordinates on the basis of the gradient. A popular technique is the Gauss—Southwell 
Method where at each stage the coordinate corresponding to the largest (in absolute 
value) component of the gradient vector is selected for descent. 


Global Convergence 


It is simple to prove global convergence for cyclic coordinate descent. The 
algorithmic map A is the composition of 21 maps 


A=SC"SC"'...SC', 


where C'(x) = (x, e;) with e; equal to the ith unit vector, and S is the usual line 
search algorithm but over the doubly infinite line rather than the semi-infinite line. 
The map C’ is obviously continuous and § is closed. If we assume that points are 
restricted to a compact set, then A is closed by Corollary 1, Section 7.7. We define 
the solution set [ = {x : V(x) = 0}. If we impose the mild assumption on f that 
a search along any coordinate direction yields a unique minimum point, then the 
function Z(x) = f(x) serves as a continuous descent function for A with respect 
to I’. This is because a search along any coordinate direction either must yield a 
decrease or, by the uniqueness assumption, it cannot change position. Therefore, 
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if at a point x we have Vf(x) 490, then at least one component of V(x) does 
not vanish and a search along the corresponding coordinate direction must yield a 
decrease. 


Local Convergence Rate 


It is difficult to compare the rates of convergence of these algorithms with the rates 
of others that we analyze. This is partly because coordinate descent algorithms are 
from an entirely different general class of algorithms than, for example, steepest 
descent and Newton’s method, since coordinate descent algorithms are unaffected 
by (diagonal) scale factor changes but are affected by rotation of coordinates—the 
opposite being true for steepest descent. Nevertheless, some comparison is possible. 

It can be shown (see Exercise 20) that for the same quadratic problem as treated 
in Section 8.6, there holds for the Gauss—Southwell method 


a 


Ble) < (1- zr) BG) (57) 


where a, A are as in Section 8.6 and v is the dimension of the problem. Since 


Aja a a ' 

(C=) <(1 a<(t a) , ©8) 
we see that the bound we have for steepest descent is better than the bound we have 
for n — 1| applications of the Gauss—Southwell scheme. Hence we might argue that 
it takes essentially n — 1 coordinate searches to be as effective as a single gradient 
search. This is admittedly a crude guess, since (47) is generally not a tight bound, 
but the overall conclusion is consistent with the results of many experiments. Indeed, 
unless the variables of a problem are essentially uncoupled from each other (corre- 
sponding to a nearly diagonal Hessian matrix) coordinate descent methods seem 
to require about n line searches to equal the effect of one step of steepest descent. 

The above discussion again illustrates the general objective that we seek in 
convergence analysis. By comparing the formula giving the rate of convergence 
for steepest descent with a bound for coordinate descent, we are able to draw 
some general conclusions on the relative performance of the two methods that are 
not dependent on specific values of a and A. Our analyses of local convergence 
properties, which usually involve specific formulae, are always guided by this 
objective of obtaining general qualitative comparisons. 


Example. The quadratic problem considered in Section 8.6 with 


0.78 —0.02 —0.12 —0.14 
—0.02 0.86 —0.04 0.06 
—0.12 —0.04 0.72 —0.08 
—0.14 0.06 —0.08 0.74 


b = (0.76, 0.08, 1.12, 0.68) 


Q= 
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Table 8.3 Solutions to Example 


Value of f for various methods 


Iteration no. Gauss-Southwell Cyclic Double sweep 


0 0.0 0.0 0.0 
1 —0.871111 —0.370256 —0.370256 
2 — 1.445584 —0.376011 —0.376011 
3 —2.087054 — 1.446460 — 1.446460 
4 —2.130796 —2.052949 —2.052949 
3 —2.163586 —2.149690 —2.060234 
6 —2.170272 —2.149693 —2.060237 
7 —2.172786 —2.167983 —2.165641 
8 —2.174279 —2.173169 —2.165704 
9 —2.174583 —2.174392 —2.168440 
10 —2.174638 —2.174397 —2.173981 
11 —2.174651 —2.174582 —2.174048 
12 —2.174655 —2.174643 —2.174054 
13 —2.174658 —2.174656 —2.174608 
14 —2.174659 —2.174656 —2.174608 
15 —2.174659 —2.174658 —2.174622 
16 —2.174659 —2.174655 
17 —2.174659 —2.174656 
18 —2.174656 
19 —2.174659 
20 —2.174659 


was solved by the various coordinate search methods. The corresponding values of 
the objective function are shown in Table 8.3. Observe that the convergence rates 
of the three coordinate search methods are approximately equal but that they all 
converge about three times slower than steepest descent. This is in accord with the 
estimate given above for the Gauss-Southwell method, since in this case n— 1 = 3. 


8.10 SPACER STEPS 


In some of the more complex algorithms presented in later chapters, the rule used to 
determine a succeeding point in an iteration may depend on several previous points 
rather than just the current point, or it may depend on the iteration index k. Such 
features are generally introduced in order to obtain a rapid rate of convergence but 
they can grossly complicate the analysis of global convergence. 

If in such a complex sequence of steps there is inserted, perhaps irregularly 
but infinitely often, a step of an algorithm such as steepest descent that is known to 
converge, then it is not difficult to insure that the entire complex process converges. 
The step which is repeated infinitely often and guarantees convergence is called a 
spacer step, since it separates disjoint portions of the complex sequence. Essentially 
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the only requirement imposed on the other steps of the process is that they do not 
increase the value of the descent function. 

This type of situation can be analyzed easily from the following viewpoint. 
Suppose B is an algorithm which together with the descent function Z and solution 
set I, satisfies all the requirements of the Global Convergence Theorem. Define 
the algorithm C by C(x) = {y: Z(y) < Z(x)}. In other words, C applied to x can 
give any point so long as it does not increase the value of Z. It is easy to verify 
that C is closed. We imagine that B represents the spacer step and the complex 
process between spacer steps is just some realization of C. Thus the overall process 
amounts merely to repeated applications of the composite algorithm CB. With this 
viewpoint we may state the Spacer Step Theorem. 


Spacer Step Theorem. Suppose B is an algorithm on X which is closed outside 
the solution set T. Let Z be a descent function corresponding to B and I. 
Suppose that the sequence {x,} is generated satisfying 


X41 € B(x) 
for k in an infinite index set K, and that 
Z(X41) < Z(%) 


for all k. Suppose also that the set S = {x : Z(x) < Z(xX,)} is compact. Then the 
limit of any convergent subsequence of {X;}4 is a solution. 


Proof. We first define for any x € X, B(x) = SMB(x) and then observe that 
A= CB is closed outside the solution set by Corollary 1, in the subsection on closed 
mappings in Section 7.7. The Global Convergence Theorem can then be applied to 
A. Since S is compact, there is a subsequence of {x;,},<4 converging to a limit x. 
In view of the above we conclude that x cI’. Jf 


8.11 SUMMARY 


Most iterative algorithms for minimization require a line search at every stage of the 
process. By employing any one of a variety of curve fitting techniques, however, 
the order of convergence of the line search process can be made greater than unity, 
which means that as compared to the linear convergence that accompanies most 
full descent algorithms (such as steepest descent) the individual line searches are 
rapid. Indeed, in common practice, only about three search points are required in 
any one line search. 

It was shown in Sections 8.4, 8.5 and the exercises that line search algorithms 
of varying degrees of accuracy are all closed. Thus line searching is not only rapid 
enough to be practical but also behaves in such a way as to make analysis of global 
convergence simple. 

The most important result of this chapter is the fact that the method of steepest 
descent converges linearly with a convergence ratio equal to [(A—a)/(A+a)]’, 
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where a and A are, respectively, the smallest and largest eigenvalues of the Hessian 
of the objective function evaluated at the solution point. This formula, which arises 
frequently throughout the remainder of the book, serves as a fundamental reference 
point for other algorithms. It is, however, important to understand that it is the 
formula and not its value that serves as the reference. We rarely advocate that 
the formula be evaluated since it involves quantities (namely eigenvalues) that are 
generally not computable until after the optimal solution is known. The formula 
itself, however, even though its value is unknown, can be used to make significant 
comparisons of the effectiveness of steepest descent versus other algorithms. 

Newton’s method has order two convergence. However, it must be modified 
to insure global convergence, and evaluation of the Hessian at every point can be 
costly. Nevertheless, Newton’s method provides another valuable reference point 
in the study of algorithms, and is frequently employed in interior point methods 
using a logarithmic barrier function. 

Coordinate descent algorithms are valuable only in the special situation where 
the variables are essentially uncoupled or there is special structure that makes 
searching in the coordinate directions particularly easy. Otherwise steepest descent 
can be expected to be faster. Even if the gradient is not directly available, it would 
probably be better to evaluate a finite-difference approximation to the gradient, 
by taking a single step in each coordinate direction, and use this approximation 
in a steepest descent algorithm, rather than executing a full line search in each 
coordinate direction. 

Finally, Section 8.10 explains that global convergence is guaranteed simply by 
the inclusion, in a complex algorithm, of spacer steps. This result is called upon 
frequently in what follows. 


8.12 EXERCISES 


1. Show that g[a, b, c] defined by (14) is symmetric, that is, interchange of the arguments 
does not affect its value. 


2. Prove (14) and (15). 
Hint: To prove (15) expand it, and subtract and add g’(x,) to the numerator. 


3. Argue using symmetry that the error in the cubic fit method approximately satisfies an 
equation of the form 


2 2 
Exe = M(epep_1 + &xEx_1) 


and then find the order of convergence. 


4. What conditions on the values and derivatives at two points guarantee that a cubic 
polynomial fit to this data will have a minimum between the two points? Use your 
answer to develop a search scheme, based on cubic fit, that is globally convergent for 
unimodal functions. 


5. Using a symmetry argument, find the order of convergence for a line search method 
that fits a cubic to x,_3, X42, Xp_1, X, in order to find x,,,. 
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6. Consider the iterative process 


1 a 
TH = 5 re > 
k 


where a > 0. Assuming the process converges, to what does it converge? What is the 
order of convergence? 


7. Suppose the continuous real-valued function f of a single variable satisfies 
ann 0) = £00); 
Starting at any x > 0 show that, through a series of halvings and doublings of x and 


evaluation of the corresponding f(x)’s, a three-point pattern can be determined. 


8. For 5 > 0 define the map S° by 


6 — age : — 3 
S°(x, d)={y:y=x+t+ad, 0<a<6; Aly) = min, P+ Bd)}. 


Thus S° searches the interval [0, 5] for a minimum of f(x + ad), representing a “limited 
range” line search. Show that if f is continuous, S° is closed at all (x, d). 


9. For ¢ > 0 define the map °S by 
SE Capa eet. C2 JS ae Pee 


Show that if f is continuous, °S is closed at (x, d) if d 40. This map corresponds to 
an “inaccurate” line search. 


10. Referring to the previous two exercises, define and prove a result for °S°. 


11. Define S as the line search algorithm that finds the first relative minimum of f(x+ ad) 
for a > 0. If f is continuous and d 4 0, is S closed? 


12. Consider the problem 
minimize 5x?+S5y?—xy—Ilx+1ly4+11. 


a) Find a point satisfying the first-order necessary conditions for a solution. 

b) Show that this point is a global minimum. 

c) What would be the rate of convergence of steepest descent for this problem? 

d) Starting at x = y= 0, how many steepest descent iterations would it take (at most) 
to reduce the function value to 107!!? 


13. Define the search mapping F that determines the parameter @ to within a given fraction 
c, O<c<l, by 


F(x, d)={y:y=x+ad, 0<a<o, |a| <ca, where 
d = 
— f(x+ad) = 0}. 
da 


Show that if d 4 0 and (d/da) f(x + ad) is continuous, then F is closed at (x, d). 
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14. Let e;, e@, ..., e, denote the eigenvectors of the symmetric positive definite n x n 


15. 


16. 


17. 


18. 


19. 


20. 


matrix Q. For the quadratic problem considered in Section 8.6, suppose xy is chosen 
so that g) belongs to a subspace M spanned by a subset of the e,;’s. Show that for the 
method of steepest descent g, € M for all k. Find the rate of convergence in this case. 


Suppose we use the method of steepest descent to minimize the quadratic function 
f(x) = 5 (x —x*)"Q(x—x"*) but we allow a tolerance +6a, 6 > 0) in the line search, 
that is 


Xe = Xe ASr> 
where 
(1-8)a, <a < (1+ 6)m% 


and @, minimizes f(x, — a@g,) over a. 


a) Find the convergence rate of the algorithm in terms of a and A, the smallest and 
largest eigenvalues of Q, and the tolerance 6. 
Hint: Assume the extreme case a, = (1+ 6)aQ,. 

b) What is the largest 6 that guarantees convergence of the algorithm? Explain this 
result geometrically. 

c) Does the sign of 6 make any difference? 


Show that for a quadratic objective function the percentage test and the Goldstein test 
are equivalent. 


Suppose in the method of steepest descent for the quadratic problem, the value of a, is 
not determined to minimize E(x,,,) exactly but instead only satisfies 


E(x) — EX) S pee) —E 
E(x;) = E(x;) 


for some B, 0 < B < 1, where EF is the value that corresponds to the best @,. Find the 
best estimate for the rate of convergence in this case. 


Suppose an iterative algorithm of the form 
Xp = Xe + dy 


is applied to the quadratic problem with matrix Q, where a, as usual is chosen as 
the minimum point of the line search and where d, is a vector satisfying d/g, <0 
and (d/g,)? > B(d/ Qd,)(giQ'g,), where 0 < 8 < 1. This corresponds to a steepest 
descent algorithm with “sloppy” choice of direction. Estimate the rate of convergence 
of this algorithm. 


Repeat Exercise 18 with the condition on (d/g,)? replaced by 


(dig.) > Bid, (gig), O<B<1. 


Use the result of Exercise 19 to derive (57) for the Gauss-Southwell method. 
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21. 


22: 


23. 
24. 


25. 


26. 


QI: 
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Let f(x, y)=s?+y?+xy—3x. 


a) Find an unconstrained local minimum point of f. 

b) Why is the solution to (a) actually a global minimum point? 

c) Find the minimum point of f subject to x >0, y > 0. 

d) If the method of steepest descent were applied to (a), what would be the rate of 
convergence of the objective function? 


Find an estimate for the rate of convergence for the modified Newton method 
-1 
Xu =X — Oy (EA +.) & 


given by (51) and (52) when 6 is larger than the smallest eigenvalue of F(x*). 
Prove global convergence of the Gauss-Southwell method. 


Consider a problem of the form 


minimize f(x) 


subject to x>0, 


where x € E”. A gradient-type procedure has been suggested for this kind of problem 


that accounts for the constraint. At a given point x = (x,, x5, ..., X,,), the direction 
d=(d,, d,, ..., d,) is determined from the gradient V f(x)’ =g =(g), &, ---, &,) by 
po —g, if x,>0 or g,<0 


O if x,=0 and g,>0. 


This direction is then used as a direction of search in the usual manner. 


a) What are the first-order necessary conditions for a minimum point of this problem? 

b) Show that d, as determined by the algorithm, is zero only at a point satisfying the 
first-order conditions. 

c) Show that if d 4 0, it is possible to decrease the value of f by movement along d. 

d) If restricted to a compact region, does the Global Convergence Theorem apply? Why? 


Consider the quadratic problem and suppose Q has unity diagonal. Consider a coordinate 
descent procedure in which the coordinate to be searched is at every stage selected 
randomly, each coordinate being equally likely. Let €, = x, —x*. Assuming €, is known, 
show that €/, ,Qe,,,, the expected value of €/, ,Qe;,,;, satisfies 


+= £,Q°e, a 
£1 QE. = (: * t) efQe, < (1-5) £{ Qe,. 


ers 
ne, Qe, nA 


If the matrix Q has a condition number of 10, how many iterations of steepest descent 
would be required to get six place accuracy in the minimum value of the objective 
function of the corresponding quadratic problem? 


Stopping criterion. A question that arises in using an algorithm such as steepest descent 
to minimize an objective function f is when to stop the iterative process, or, in other 
words, how can one tell when the current point is close to a solution. If, as with steepest 
descent, it is known that convergence is linear, this knowledge can be used to develop a 


28. 


29. 
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stopping criterion. Let {f,}¢2.9 be the sequence of values obtained by the algorithm. We 
assume that f, — f* linearly, but both f* and the convergence ratio 8 are unknown. 
However we know that, at least approximately, 


Seti —f* =B(f,-f*) 
and 
Si -f = Bhi —f*). 


These two equations can be solved for B and f*. 


a) Show that 


ft- Si —Sinifias 
2 ft — faa — Sei 
_ Sante 
_ Sem te 


b) Motivated by the above we form the sequence {f;°} defined by 


fia tf =f 
Fe 2h Fa fen 


as the original sequence is generated. (This procedure of generating {f;} from {f,} is 
called the Aitken 5?-process.) If |f, — f*| = B* + 0(B*) show that | f — f*| = 0(B*) 
which means that {f;} converges to f* faster than {f,} does. The iterative search 
for the minimum of f can then be terminated when f, — ff is smaller than some 
prescribed tolerance. 


Show that the concordant requirement (55) can be expressed as 


NIE 


<1, 


d 1) _ 
ere 


Assume f(x) and g(x) are self-concordant. Show that the following functions are also 
self-concordant. 


(a) af(x) fora>1 
(b) ax+b+ f(x) 
(c) flax+b) 

(d) f(x) + s(x) 
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Chapter9 CONJUGATE 
DIRECTION 
METHODS 


Conjugate direction methods can be regarded as being somewhat intermediate 
between the method of steepest descent and Newton’s method. They are motivated 
by the desire to accelerate the typically slow convergence associated with steepest 
descent while avoiding the information requirements associated with the evaluation, 
storage, and inversion of the Hessian (or at least solution of a corresponding system 
of equations) as required by Newton’s method. 

Conjugate direction methods invariably are invented and analyzed for the purely 
quadratic problem 


minimize Lx’Qx —b’x, 


where Q is an n x n symmetric positive definite matrix. The techniques once worked 
out for this problem are then extended, by approximation, to more general problems; 
it being argued that, since near the solution point every problem is approximately 
quadratic, convergence behavior is similar to that for the pure quadratic situation. 

The area of conjugate direction algorithms has been one of great creativity 
in the nonlinear programming field, illustrating that detailed analysis of the pure 
quadratic problem can lead to significant practical advances. Indeed, conjugate 
direction methods, especially the method of conjugate gradients, have proved to be 
extremely effective in dealing with general objective functions and are considered 
among the best general purpose methods. 


9.1 CONJUGATE DIRECTIONS 


Definition. Given a symmetric matrix Q, two vectors d, and d, are said to 
be Q-orthogonal, or conjugate with respect to Q, if d?Qd, = 0. 


In the applications that we consider, the matrix Q will be positive definite but this 
is not inherent in the basic definition. Thus if Q = 0, any two vectors are conjugate, 
while if Q =I, conjugacy is equivalent to the usual notion of orthogonality. A finite 
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set of vectors do, d,,...,d, is said to be a Q-orthogonal set if d/Qd, = 0 for all 
ixj. 
Proposition. If Q is positive definite and the set of nonzero vectors do, d,, 
d,,...,d, are Q-orthogonal, then these vectors are linearly independent. 


Proof. Suppose there are constants a;, i= 0, 1, 2, ..., & such that 
Ady) +---+a,d, = 9. 
Multiplying by Q and taking the scalar product with d; yields 
ad! Qd, = 0. 


Or, since d/Qd, > 0 in view of the positive definiteness of Q, we have a; = 0. | 


Before discussing the general conjugate direction algorithm, let us investigate 
just why the notion of Q-orthogonality is useful in the solution of the quadratic 
problem 


minimize tx" Qx —b’x, (1) 


when Q is positive definite. Recall that the unique solution to this problem is also 
the unique solution to the linear equation 


Qx =b, (2) 
and hence that the quadratic minimization problem is equivalent to a linear equation 
problem. 

Corresponding to the n xn positive definite matrix Q let dp, d,,...,d,_, 


be n nonzero Q-orthogonal vectors. By the above proposition they are linearly 
independent, which implies that the solution x* of (1) or (2) can be expanded in 
terms of them as 


x= Ady ae Gate ad} (3) 


for some set of a,’s. In fact, multiplying by Q and then taking the scalar product 
with d, yields directly 
d} Qx* db 
a.= = . 
'  d?Qd, = d?Qd, 


(4) 


This shows that the @;’s and consequently the solution x* can be found by evaluation 
of simple scalar products. The end result is 


n—-1 ib 
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There are two basic ideas imbedded in (5). The first is the idea of selecting 
an orthogonal set of d;’s so that by taking an appropriate scalar product, all terms 
on the right side of (3), except the ith, vanish. This could, of course, have been 
accomplished by making the d,’s orthogonal in the ordinary sense instead of making 
them Q-orthogonal. The second basic observation, however, is that by using Q- 
orthogonality the resulting equation for a; can be expressed in terms of the known 
vector b rather than the unknown vector x*; hence the coefficients can be evaluated 
without knowing x”. 

The expansion for x* can be considered to be the result of an iterative process 
of n steps where at the ith step a,d,; is added. Viewing the procedure this way, and 
allowing for an arbitrary initial point for the iteration, the basic conjugate direction 
method is obtained. 


n—-1 


Conjugate Direction Theorem. Let {d;}/-) be a set of nonzero Q-orthogonal 
vectors. For any X, € E” the sequence {x,} generated according to 


X41 =X + ad, k20 (6) 
with 
gid 
pamieer fa (7) 
d;, Qd, 
and 
&; = Qx, —b, 


converges to the unique solution, x*, of Qx = b after n steps, that is, x, =x". 


Proof. Since the d,,’s are linearly independent, we can write 
X" — Xp = Ady + ad, +---+a,_)d,_; 


for some set of a,’s. As we did to get (4), we multiply by Q and take the scalar 
product with d, to find 


= d/Q(x* — Xp) 


a, = 8 
Now following the iterative process (6) from x, up to x, gives 
Xj; — Xp = Ady + a,d, +--- + A, dy), (9) 


and hence by the Q-orthogonality of the d,’s it follows that 


di Q(x, — Xp) =0. (10) 
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Substituting (10) into (8) produces 


ave dj Q(x* — x,) _ gid, 
. d’Qd, d’Qd,’ 


which is identical with (7). fl 


To this point the conjugate direction method has been derived essentially 
through the observation that solving (1) is equivalent to solving (2). The conjugate 
direction method has been viewed simply as a somewhat special, but nevertheless 
straightforward, orthogonal expansion for the solution to (2). This viewpoint, 
although important because of its underlying simplicity, ignores some of the most 
important aspects of the algorithm; especially those aspects that are important when 
extending the method to nonquadratic problems. These additional properties are 
discussed in the next section. 

Also, methods for selecting or generating sequences of conjugate directions 
have not yet been presented. Some methods for doing this are discussed in the 
exercises; while the most important method, that of conjugate gradients, is discussed 
in Section 9.3. 


9.2 DESCENT PROPERTIES OF THE CONJUGATE 
DIRECTION METHOD 


We define B, as the subspace of E” spanned by {dy, dj,...,d,_,}. We shall 
show that as the method of conjugate directions progresses each x, minimizes the 
objective over the k-dimensional linear variety x) + B,. 
Expanding Subspace Theorem. Let {d;}'=) be a sequence of nonzero Q- 
orthogonal vectors in E". Then for any X) € E" the sequence {x,} generated 
according to 


Xp4) =X + ad, (11) 
gid, 

=— 12 

ay d’Qd, ( ) 


has the property that x, minimizes f(x) = 5x’Qx—b’x on the line x = x,_;+ 
ad,_,;, —% < a@ < ©, as well as on the linear variety X) + B,. 


Proof. It need only be shown that x, minimizes f on the linear variety x) + B,, 
since it contains the line x = x,_,; + ad,_,. Since f is a strictly convex function, 
the conclusion will hold if it can be shown that g, is orthogonal to B, (that is, the 
gradient of f at x, is orthogonal to the subspace B,). The situation is illustrated in 
Fig. 9.1. (Compare Theorem 2, Section 7.5.) 
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Xo + By 


Fig. 9.1 Conjugate direction method 


We prove g, | B, by induction. Since By is empty that hypothesis is true 
for k = 0. Assuming that it is true for k, that is, assuming g, | B,, we show that 
Zi, L By. We have 


Bin) = 8, + a,Qd,, (13) 
and hence 
dig... = dg, +.0,d/Qd, =0 (14) 
by definition of a,. Also for i<k 
d/g.., =d/g, +a,d/ Qd,. (15) 


The first term on the right-hand side of (15) vanishes because of the induction 
hypothesis, while the second vanishes by the Q-orthogonality of the d,’s. Thus 


Sint L Buys. Il 


Corollary. In the method of conjugate directions the gradients g,, k =0, 1, ..., 
n satisfy 


gid;=0 for i<k. 


The above theorem is referred to as the Expanding Subspace Theorem, 
since the B,’s form a sequence of subspaces with B,,, D B,. Since x, 
minimizes f over X)+ B,, it is clear that x, must be the overall minimum 


of f. 
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Brat 


Xie] 


Xx 


Fig. 9.2 Interpretation of expanding subspace theorem 


To obtain another interpretation of this result we again introduce the function 
E(x) = 3(x—x")'Q(x—x*) (16) 


as a measure of how close the vector x is to the solution x*. Since E(x) = f(x) + 
(1/2)x*"Qx* the function E can be regarded as the objective that we seek to 
minimize. 

By considering the minimization of E we can regard the original problem as 
one of minimizing a generalized distance from the point x*. Indeed, if we had 
Q =I, the generalized notion of distance would correspond (within a factor of two) 
to the usual Euclidean distance. For an arbitrary positive-definite Q we say E isa 
generalized Euclidean metric or distance function. Vectors d;, i= 0, 1, ...,n—1 
that are Q-orthogonal may be regarded as orthogonal in this generalized Euclidean 
space and this leads to the simple interpretation of the Expanding Subspace Theorem 
illustrated in Fig. 9.2. For simplicity we assume xX, = 0. In the figure d, is shown 
as being orthogonal to B, with respect to the generalized metric. The point x, 
minimizes E over B, while x,,, minimizes E over B,,,. The basic property is that, 
since d, is orthogonal to B,, the point x,,, can be found by minimizing E along 
d, and adding the result to x,. 


9.3 THE CONJUGATE GRADIENT METHOD 


The conjugate gradient method is the conjugate direction method that is obtained by 
selecting the successive direction vectors as a conjugate version of the successive 
gradients obtained as the method progresses. Thus, the directions are not specified 
beforehand, but rather are determined sequentially at each step of the iteration. At 
step k one evaluates the current negative gradient vector and adds to it a linear 
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combination of the previous direction vectors to obtain a new conjugate direction 
vector along which to move. 

There are three primary advantages to this method of direction selection. First, 
unless the solution is attained in less than n steps, the gradient is always nonzero 
and linearly independent of all previous direction vectors. Indeed, the gradient g, 
is orthogonal to the subspace B, generated by do, d,, ...,d,_,. If the solution is 
reached before n steps are taken, the gradient vanishes and the process terminates— 
it being unnecessary, in this case, to find additional directions. 

Second, a more important advantage of the conjugate gradient method is the 
especially simple formula that is used to determine the new direction vector. This 
simplicity makes the method only slightly more complicated than steepest descent. 

Third, because the directions are based on the gradients, the process makes good 
uniform progress toward the solution at every step. This is in contrast to the situation 
for arbitrary sequences of conjugate directions in which progress may be slight until 
the final few steps. Although for the pure quadratic problem uniform progress is of no 
great importance, it is important for generalizations to nonquadratic problems. 


Conjugate Gradient Algorithm 


Starting at any x, € E” define dy) = —g) = b— Qx, and 


Xp) = X + a,d; (17) 
gid, 

=- 18 

a1 Qd, “= 

Gy) = —Be41 + Bd, (19) 
g/,,Qd, 

=, 20 


where g, = Qx, —b. 

In the algorithm the first step is identical to a steepest descent step; each 
succeeding step moves in a direction that is a linear combination of the current 
gradient and the preceding direction vector. The attractive feature of the algorithm 
is the simple formulae, (19) and (20), for updating the direction vector. The method 
is only slightly more complicated to implement than the method of steepest descent 
but converges in a finite number of steps. 


Verification of the Algorithm 


To verify that the algorithm is a conjugate direction algorithm, it is necessary 
to verify that the vectors {d,} are Q-orthogonal. It is easiest to prove this by 
simultaneously proving a number of other properties of the algorithm. This is done 
in the theorem below where the notation [d,, d,, ...,d,] is used to denote the 
subspace spanned by the vectors do, d,, ..., dy. 
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Conjugate Gradient Theorem. The conjugate gradient algorithm (17)-(20) is 
a conjugate direction method. If it does not terminate at X,, then 


a) [8o,81,---» Sc] = [Bo Qgo, ---, MB] 

b) [dy, dy, -.-, dx] = [Bo QB, ---, O80] 

c) d{Qd;=0 fori<k—-1 

d) a, = g/g,/d; Qd, 

©) Be = B41 B41/8i Be- 
Proof. We first prove (a), (b) and (c) simultaneously by induction. Clearly, they 
are true for k = 0. Now suppose they are true for k, we show that they are true for 
k+ 1. We have 


Sig = 2 + @,Qd,. 


By the induction hypothesis both g, and Qd, belong to [go, Qgo,..., Q**'gy], the 
first by (a) and the second by (b). Thus g,,, € [go, Qgo, ..., Q**' gy]. Furthermore 
B.41 ¢ [Bo QBo, ---, QB] = [do. d), ..., d,] since otherwise g,,, = 0, because for 
any conjugate direction method g,, ; is orthogonal to [dp, d,, ..., d,]. (The induction 
hypothesis on (c) guarantees that the method is a conjugate direction method up to 
X;,1-) Thus, finally we conclude that 


[Zo Bie--+> Seri] =[Bo, Qgo.---, Q*"'g, |, 


which proves (a). 
To prove (b) we write 


Oy) = —Br41 + Bed, 


and (b) immediately follows from (a) and the induction hypothesis on (b). 
Next, to prove (c) we have 


di, Qa, = —gi,,Qd; + B,d{ Qd,. 


For i=k the right side is zero by definition of B,. For i < k both terms vanish. 

The first term vanishes since Qd, € [d,,d5,...,dj;,,], the induction hypothesis 

which guarantees the method is a conjugate direction method up to x,,,, and 

by the Expanding Subspace Theorem that guarantees that g,,, is orthogonal to 

[dj,d,,...,d,,,;]. The second term vanishes by the induction hypothesis on (c). 

This proves (c), which also proves that the method is a conjugate direction method. 
To prove (d) we have 


—gid, = Bi Bi = Bya8 Oy-1> 


and the second term is zero by the Expanding Subspace Theorem. 
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Finally, to prove (e) we note that g/, ,g, = 0, because g, € [dy,...,d,] and 
2,., 1s orthogonal to [dy, ...,d,]. Thus since 


1 
Qd, = — (841 — &)> 
Oy, 
we have 


1 
g,,Qd, = — BF Bet: | 
Oa 


Parts (a) and (b) of this theorem are a formal statement of the interrelation 
between the direction vectors and the gradient vectors. Part (c) is the equation 
that verifies that the method is a conjugate direction method. Parts (d) and (e) are 
identities yielding alternative formulae for a, and 8, that are often more convenient 
than the original ones. 


9.4 THE C—G METHOD AS AN OPTIMAL PROCESS 


We turn now to the description of a special viewpoint that leads quickly to some 
very profound convergence results for the method of conjugate gradients. The basis 
of the viewpoint is part (b) of the Conjugate Gradient Theorem. This result tells 
us the spaces B, over which we successively minimize are determined by the 
original gradient g) and multiplications of it by Q. Each step of the method brings 
into consideration an additional power of Q times gp. It is this observation we 
exploit. 

Let us consider a new general approach for solving the quadratic minimization 
problem. Given an arbitrary starting point Xo, let 


X41 = Xo + P,(Q)¥0, (21) 


where P, is a polynomial of degree k. Selection of a set of coefficients for each of 
the polynomials P, determines a sequence of x,’s. We have 


X41 —X = Xp —X* + P,(QHQA(K — x") 
(22) 
= [I+ QP, (Q)](%> — x"), 


and hence 
E (X41) = 5 (ei — X*)’ OK 41 —X") 
= $(%) — x") QUT + QP, (Q)]?& — x’). 


We may now pose the problem of selecting the polynomial P, in such a 
way as to minimize E(x,,,) with respect to all possible polynomials of degree k. 
Expanding (21), however, we obtain 


(23) 


Xp41 =Xq + YoLo + V:QBo +- °° + ¥,Q*g), (24) 
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where the y,’s are the coefficients of P,. In view of 


By; =[do, di,.-., dk] =[Zo, Qgo,---, Q* go], 


the vector X,..; =X) +@ dy +a ,d,+...+a,d, generated by the method of conjugate 
gradients has precisely this form; moreover, according to the Expanding Subspace 
Theorem, the coefficients y; determined by the conjugate gradient process are such 
as to minimize E(x,,,). Therefore, the problem posed of selecting the optimal P, 
is solved by the conjugate gradient procedure. 

The explicit relation between the optimal coefficients y, of P, and the constants 
a;, B; associated with the conjugate gradient method is, of course, somewhat 
complicated, as is the relation between the coefficients of P, and those of P,,,. The 
power of the conjugate gradient method is that as it progresses it successively solves 
each of the optimal polynomial problems while updating only a small amount of 
information. 

We summarize the above development by the following very useful theorem. 


Theorem 1. The point x,,, generated by the conjugate gradient method 
satisfies 


E41) = min $(%) —X°) Q[L+ QP. (Q)F (Xo —x°), (25) 


where the minimum is taken with respect to all polynomials P, of degree k. 


Bounds on Convergence 


To use Theorem | most effectively it is convenient to recast it in terms of eigen- 
vectors and eigenvalues of the matrix Q. Suppose that the vector x, — x* is written 
in the eigenvector expansion 


X—-x*=§e4+6e+---+é,e,, 


where the e,;’s are normalized eigenvectors of Q. Then since Q(x) — x*) = A, €,e; + 
A, €,€) +... +A,€,e, and since the eigenvectors are mutually orthogonal, we have 


E(%) = H%-x')?Q)—x) = FLAK, (26) 


where the A,’s are the corresponding eigenvalues of Q. Applying the same manip- 
ulations to (25), we find that for any polynomial P, of degree k there holds 


E(X41) < ‘yl +A,P,(A)) PA E- 
i=l 
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It then follows that 


E(X,41) < max([1 + APA) :e Nee 


i=l! 


and hence finally 


E(X41) < max{1 ae A;P,(A)T E(&). 


We summarize this result by the following theorem. 


Theorem 2. In the method of conjugate gradients we have 


E(Xx41) < max[1 +A;Pi(A)) PE(%) (27) 


for any polynomial P, of degree k, where the maximum is taken over all 
eigenvalues i; of Q. 


This way of viewing the conjugate gradient method as an optimal process is 
exploited in the next section. We note here that it implies the far from obvious fact 
that every step of the conjugate gradient method is at least as good as a steepest 
descent step would be from the same point. To see this, suppose x, has been 
computed by the conjugate gradient method. From (24) we know x, has the form 


Xj, =X + YoSo + V:Q8 +: -- + ¥%_:Q* |g. 


Now if x,,,; is computed from x, by steepest descent, then x,,, = x, — a@,g, for 
some a@,. In view of part (a) of the Conjugate Gradient Theorem x,,, will have 
the form (24). Since for the conjugate direction method E(x,,,) is lower than any 
other x,,, of the form (24), we obtain the desired conclusion. 

Typically when some information about the eigenvalue structure of Q is known, 
that information can be exploited by construction of a suitable polynomial P, to 
use in (27). Suppose, for example, it were known that Q had only m < n distinct 
eigenvalues. Then it is clear that by suitable choice of P,,_, it would be possible 
to make the mth degree polynomial 1+ AP,,_,;(A) have its m zeros at the m 
eigenvalues. Using that particular polynomial in (27) shows that E(x,,) = 0. Thus 
the optimal solution will be obtained in at most m, rather than n, steps. More 
sophisticated examples of this type of reasoning are contained in the next section 
and in the exercises at the end of the chapter. 


9.5 THE PARTIAL CONJUGATE GRADIENT 
METHOD 


A collection of procedures that are natural to consider at this point are those in 
which the conjugate gradient procedure is carried out for m+ 1 <n steps and then, 
rather than continuing, the process is restarted from the current point and m+ 1 
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more conjugate gradient steps are taken. The special case of m = 0 corresponds 
to the standard method of steepest descent, while m = n— 1 corresponds to the 
full conjugate gradient method. These partial conjugate gradient methods are of 
extreme theoretical and practical importance, and their analysis yields additional 
insight into the method of conjugate gradients. The development of the last section 
forms the basis of our analysis. 

As before, given the problem 


minimize 4x’Qx—b’x, (28) 


we define for any point x, the gradient g, = Qx, —b. We consider an iteration 
scheme of the form 


Xp = XH Pt (Q)g;. (29) 


where P* is a polynomial of degree m. We select the coefficients of the polynomial 
P* so as to minimize 


E(X.41) = 5 (Keay = xX) Q(X =x"); (30) 


where x* is the solution to (28). In view of the development of the last section, it 
is clear that x,,, can be found by taking m+ 1 conjugate gradient steps rather than 
explicitly determining the appropriate polynomial directly. (The sequence indexing 
is slightly different here than in the previous section, since now we do not give 
separate indices to the intermediate steps of this process. Going from x; to x;,, by 
the partial conjugate gradient method involves m other points.) 

The results of the previous section provide a tool for convergence analysis 
of this method. In this case, however, we develop a result that is of particular 
interest for Q’s having a special eigenvalue structure that occurs frequently in 
optimization problems, especially, as shown below and in Chapter 12, in the context 
of penalty function methods for solving problems with constraints. We imagine that 
the eigenvalues of Q are of two kinds: there are m large eigenvalues that may or 
may not be located near each other, and n — m smaller eigenvalues located within 
an interval [a, b]. Such a distribution of eigenvalues is shown in Fig. 9.3. 

As an example, consider as in Section 8.7 the problem on E” 


minimize 4x’Qx—b’x 
subject to ¢c?x=0, 


n—m eigenvalues m large eigenvalues 


| 
T 
0 a b 


Fig. 9.3 Eigenvalue distribution 
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where Q is a symmetric positive definite matrix with eigenvalues in the interval 
[a, A] and b and ¢ are vectors in E”. This is a constrained problem but it can be 
approximated by the unconstrained problem 


BS LT Td Ty)2 
minimize 5x Qx—b’x+,; mex)’, 


where pm is a large positive constant. The last term in the objective function is 
called a penalty term; for large 2 minimization with respect to x will tend to make 
c’x small. 

The total quadratic term in the objective is 5x’(Q+ pec’ )x, and thus it is 
appropriate to consider the eigenvalues of the matrix Q+jcc’. As w tends to 
infinity it can be shown (see Chapter 13) that one eigenvalue of this matrix tends to 
infinity and the other n — | eigenvalues remain bounded within the original interval 
[a, A]. 

As noted before, if steepest descent were applied to a problem with such a 
structure, convergence would be governed by the ratio of the smallest to largest 
eigenvalue, which in this case would be quite unfavorable. In the theorem below it is 
stated that by successively repeating m+ | conjugate gradient steps the effects of the 
m largest eigenvalues are eliminated and the rate of convergence is determined as 
if they were not present. A computational example of this phenomenon is presented 
in Section 13.5. The reader may find it interesting to read that section right after 
this one. 


Theorem (Partial conjugate gradient method). Suppose the symmetric positive 
definite matrix Q has n—m eigenvalues in the interval [a, b], a >0 and 
the remaining m eigenvalues are greater than b. Then the method of partial 
conjugate gradients, restarted every m+ 1 steps, satisfies 


E(u) < (2-2) ee. G1) 


(The point x,,, is found from x, by taking m+ 1 conjugate gradient steps so 
that each increment in k is a composite of several simple steps.) 


Proof. Application of (27) yields 


E (4) < max[1 + A;PO,)PEC%,) (32) 


for any mth-order polynomial P, where the A,’s are the eigenvalues of Q. Let us 
select P so that the (m+ 1)th-degree polynomial g(A) = 1+ AP(A) vanishes at 
(a+b)/2 and at the m large eigenvalues of Q. This is illustrated in Fig. 9.4. For 
this choice of P we may write (32) as 


E(X 41) < [t+ P(A) PE(%)- 


max 
aga;<b 


Since the polynomial g(A) = 1+AP(A) has m+ 1 real roots, q'(A) will have m real 
roots which alternate between the roots of g(A) on the real axis. Likewise, q’(A) 
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Fig. 9.4 Construction for proof 


will have m—1 real roots which alternate between the roots of g’(A). Thus, since 
q(A) has no root in the interval (—0o, (a+ b)/2), we see that g’(A) does not change 
sign in that interval; and since it is easily verified that q"(0) > 0 it follows that 
q(A) is convex for A < (a+ b)/2. Therefore, on [0, (a+ b)/2], q(A) lies below the 
line 1 —[2A/(a+b)]. Thus we conclude that 


2X 
a+b 


q a+b 22 2 
2 a+b 


We can see that on [(a+ b)/2, b] 


qA) <1- 


on [0, (a+b)/2] and that 


(A) >1 a 
ee a+b’ 

since for g(A) to cross first the line 1—[2A/(a+b)] and then the A-axis would 
require at least two changes in sign of q’(A), whereas, at most one root of q’(A) 
exists to the left of the second root of g(A). We see then that the inequality 


2X 
1+AP(A)| < |1 -—— 
[1+ APQ)| <|1- 


is valid on the interval [a, b]. The final result (31) follows immediately. J 


In view of this theorem, the method of partial conjugate gradients can be 
regarded as a generalization of steepest descent, not only in its philosophy and 
implementation, but also in its behavior. Its rate of convergence is bounded by 
exactly the same formula as that of steepest descent but with the largest eigenvalues 
removed from consideration. (It is worth noting that for m =0 the above proof 
provides a simple derivation of the Steepest Descent Theorem.) 


9.6 Extension to Nonquadratic Problems 277 


9.6 EXTENSION TO NONQUADRATIC PROBLEMS 


The general unconstrained minimization problem on E” 
minimize f(x) 


can be attacked by making suitable approximations to the conjugate gradient 
algorithm. There are a number of ways that this might be accomplished; the choice 
depends partially on what properties of f are easily computable. We look at three 
methods in this section and another in the following section. 


Quadratic Approximation 


In the quadratic approximation method we make the following associations at x,: 
g > VAX)’, Q< F(x), 


and using these associations, reevaluated at each step, all quantities necessary to 
implement the basic conjugate gradient algorithm can be evaluated. If f is quadratic, 
these associations are identities, so that the general algorithm obtained by using 
them is a generalization of the conjugate gradient scheme. This is similar to the 
philosophy underlying Newton’s method where at each step the solution of a general 
problem is approximated by the solution of a purely quadratic problem through 
these same associations. 

When applied to nonquadratic problems, conjugate gradient methods will not 
usually terminate within n steps. It is possible therefore simply to continue finding 
new directions according to the algorithm and terminate only when some termination 
criterion is met. Alternatively, the conjugate gradient process can be interrupted 
after n or n+ 1 steps and restarted with a pure gradient step. Since Q-conjugacy 
of the direction vectors in the pure conjugate gradient algorithm is dependent on 
the initial direction being the negative gradient, the restarting procedure seems to 
be preferred. We always include this restarting procedure. The general conjugate 
gradient algorithm is then defined as below. 


Step I. Starting at x) compute g) = Vf(xo)’ and set dy = —gpo. 
Step 2. Fork =0,1,...,n—1: 
—2; 4, 


a) Set x,,; =x, +a,d, where a, = aF«)d,’ 
ke BK Ay 


b) Compute g,,, = V/(X.41)"- 
c) Unless k= n—1, set dy, = —2,,; + B,d, where 


B = 21. F(x,)d, 
: d/ F(x, )d, 


and repeat (a). 
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Step 3. Replace xX) by x, and go back to Step 1. 


An attractive feature of the algorithm is that, just as in the pure form of 
Newton’s method, no line searching is required at any stage. Also, the algorithm 
converges in a finite number of steps for a quadratic problem. The undesirable 
features are that F(x,) must be evaluated at each point, which is often impractical, 
and that the algorithm is not, in this form, globally convergent. 


Line Search Methods 


It is possible to avoid the direct use of the association Q < F(x,). First, instead 
of using the formula for a, in Step 2(a) above, a, is found by a line search that 
minimizes the objective. This agrees with the formula in the quadratic case. Second, 
the formula for B, in Step 2(c) is replaced by a different formula, which is, however, 
equivalent to the one in 2(c) in the quadratic case. 

The first such method proposed was the Fletcher—Reeves method, in which 
Part (e) of the Conjugate Gradient Theorem is employed; that is, 


B= Bi Bit 
=o 

Bi Bk 
The complete algorithm (using restarts) is: 
Step 1. Given x) compute gy = V/(x,)’ and set dy = —gp. 
Step 2. Fork =0,1,...,n—1: 
a) Set x,,, =X, +a,d, where a, minimizes f(x, + ad,). 


b) Compute g., = VFC%.1)". 
c) Unless k= n—1, set dy, = —2,,; + 6,d, where 


T 
— Se¢1 8k+1 


B = 
‘ Bk 


Step 3. Replace x) by x, and go back to Step 1. 


Another important method of this type is the Polak—Ribiere method, where 


2.= (Si41 — 84) Beat 
Bi Bk 


is used to determine B,. Again this leads to a value identical to the standard formula 
in the quadratic case. Experimental evidence seems to favor the Polak—Ribiere 
method over other methods of this general type. 
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Convergence 


Global convergence of the line search methods is established by noting that a pure 
steepest descent step is taken every n steps and serves as a spacer step. Since 
the other steps do not increase the objective, and in fact hopefully they decrease 
it, global convergence is assured. Thus the restarting aspect of the algorithm is 
important for global convergence analysis, since in general one cannot guarantee 
that the directions d, generated by the method are descent directions. 

The local convergence properties of both of the above, and most other, 
nonquadratic extensions of the conjugate gradient method can be inferred from the 
quadratic analysis. Assuming that at the solution, x*, the matrix F(x*) is positive 
definite, we expect the asymptotic convergence rate per step to be at least as good 
as steepest descent, since this is true in the quadratic case. In addition to this bound 
on the single step rate we expect that the method is of order two with respect to 
each complete cycle of n steps. In other words, since one complete cycle solves 
a quadratic problem exactly just as Newton’s method does in one step, we expect 
that for general nonquadratic problems there will hold |x,..,, —x*| < c|x, —x*|? for 
some c and k =0, n, 2n, 3n, .... This can indeed be proved, and of course underlies 
the original motivation for the method. For problems with large n, however, a 
result of this type is in itself of little comfort, since we probably hope to terminate 
in fewer than n steps. Further discussion on this general topic is contained in 
Section 10.4. 


Scaling and Partial Methods 


Convergence of the partial conjugate gradient method, restarted every m+ 1 steps, 
will in general be linear. The rate will be determined by the eigenvalue structure 
of the Hessian matrix F(x*), and it may be possible to obtain fast convergence 
by changing the eigenvalue structure through scaling procedures. If, for example, 
the eigenvalues can be arranged to occur in m+ 1 bunches, the rate of the partial 
method will be relatively fast. Other structures can be analyzed by use of Theorem 2, 
Section 9.4, by using F(x*) rather than Q. 


9.7 PARALLEL TANGENTS 


In early experiments with the method of steepest descent the path of descent was 
noticed to be highly zig-zag in character, making slow indirect progress toward the 
solution. (This phenomenon is now quite well understood and is predicted by the 
convergence analysis of Section 8.6.) It was also noticed that in two dimensions 
the solution point often lies close to the line that connects the zig-zag points, as 
illustrated in Fig. 9.5. This observation motivated the accelerated gradient method 
in which a complete cycle consists of taking two steepest descent steps and then 
searching along the line connecting the initial point and the point obtained after 
the two gradient steps. The method of parallel tangents (PARTAN) was developed 
through an attempt to extend this idea to an acceleration scheme involving all 
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Fig. 9.5 Path of gradient method 


previous steps. The original development was based largely on a special geometric 
property of the tangents to the contours of a quadratic function, but the method is 
now recognized as a particular implementation of the method of conjugate gradients, 
and this is the context in which it is treated here. 

The algorithm is defined by reference to Fig. 9.6. Starting at an arbitrary point 
X, the point x, is found by a standard steepest descent step. After that, from a point 
x, the corresponding y, is first found by a standard steepest descent step from x,, 
and then x,,, is taken to be the minimum point on the line connecting x,_, and 
y,. The process is continued for n steps and then restarted with a standard steepest 
descent step. 

Notice that except for the first step, x,,, is determined from x,, not by searching 
along a single line, but by searching along two lines. The direction d, connecting 
two successive points (indicated as dotted lines in the figure) is thus determined 
only indirectly. We shall see, however, that, in the case where the objective function 
is quadratic, the d,’s are the same directions, and the x,’s are the same points, as 
would be generated by the method of conjugate gradients. 


PARTAN Theorem. For a quadratic function, PARTAN is equivalent to the 
method of conjugate gradients. 


XQ ¥3 


x3 


Y2 


Xo 
x) 


Fig. 9.6 PARTAN 
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X41 


Xp] dy XK 


Fig. 9.7 One step of PARTAN 


Proof. The proof is by induction. It is certainly true of the first step, since it 
is a steepest descent step. Suppose that xp, x,,...,X, have been generated by the 
conjugate gradient method and x,,, is determined according to PARTAN. This 
single step is shown in Fig. 9.7. We want to show that x,,, is the same point as 
would be generated by another step of the conjugate gradient method. For this to be 
true x,,, must be that point which minimizes f over the plane defined by d,_, and 
g, = V/(x,)’. From the theory of conjugate gradients, this point will also minimize 
f over the subspace determined by g, and all previous d;’s. Equivalently, we must 
find the point x where V f(x) is orthogonal to both g, and d,_,. Since y, minimizes 
f along g,, we see that Vf(y,) is orthogonal to g,. Since Vf(x;,_,) is contained in 
the subspace [d,, d,,..., d,_,] and because g, is orthogonal to this subspace by the 
Expanding Subspace Theorem, we see that Vf(x,_;) is also orthogonal to g,. Since 
V f(x) is linear in x, it follows that at every point x on the line through x,_, and 
y, we have V/f(x) orthogonal to g,. By minimizing f along this line, a point x,,, 
is obtained where in addition Vf(x,,,) is orthogonal to the line. Thus V/f(x,,,) is 
orthogonal to both g, and the line joining x,_, and y,. It follows that Vf(x,,,) is 
orthogonal to the plane. ff 


There are advantages and disadvantages of PARTAN relative to other methods 
when applied to nonquadratic problems. One attractive feature of the algorithm is 
its simplicity and ease of implementation. Probably its most desirable property, 
however, is its strong global convergence characteristics. Each step of the process 
is at least as good as steepest descent; since going from x, to y;, is exactly steepest 
descent, and the additional move to x,,, provides further decrease of the objective 
function. Thus global convergence is not tied to the fact that the process is restarted 
every n steps. It is suggested, however, that PARTAN should be restarted every n 
steps (or n+ 1 steps) so that it will behave like the conjugate gradient method near 
the solution. 

An undesirable feature of the algorithm is that two line searches are required at 
each step, except the first, rather than one as is required by, say, the Fletcher-Reeves 
method. This is at least partially compensated by the fact that searches need not 
be as accurate for PARTAN, for while inaccurate searches in the Fletcher—Reeves 
method may yield nonsensical successive search directions, PARTAN will at least 
do as well as steepest descent. 


282 Chapter 9 Conjugate Direction Methods 


9.8 EXERCISES 


1. Let Q be a positive definite symmetric matrix and suppose pp, Pp), .--, P,,_; are linearly 
independent vectors in E”. Show that a Gram—Schmidt procedure can be used to generate 
a sequence of Q-conjugate directions from the p,’s. Specifically, show that do, d,, 

d,_, defined recursively by 


seo 


do = Po 


k pT 
Px 41 Qd; 
Gia = Presi = “aOd, 


i=0 


d; 


form’s a Q-conjugate set. 


2. Suppose the p,’s in Exercise 1 are generated as moments of Q, that is, suppose 
Pp, = QA py. k = 1,2, ..., — 1. Show that the corresponding d,’s can then be generated 
by a (three-term) recursion formula where d,,; is defined only in terms of Qd,, d, and 
d,_;. 


3. Suppose the p,’s in Exercise | are taken as p, =e, where e, is the kth unit coordinate 
vector and the d,’s are constructed accordingly. Show that using d,’s in a conjugate 
direction method to minimize (2)x’Qx—b?’x is equivalent to the application of 
Gaussian elimination to solve Qx = b. 


4. Let f(x) = (%)x"’Qx —b’x be defined on E” with Q positive definite. Let x, be a 
minimum point of f over a subspace of E” containing the vector d and let x, be the 
minimum of f over another subspace containing d. Suppose f(x,) < f(x2). Show that 
X, —X is Q-conjugate to d. 


5. Let Q be a symmetric matrix. Show that any two eigenvectors of Q, corresponding to 
distinct eigenvalues, are Q-conjugate. 


6. Let Q be an nx n symmetric matrix and let dy, d,,...,d,_,; be Q-conjugate. Show 
how to find an E such that E7QE is diagonal. 


7. Show that in the conjugate gradient method Qd,_, € By... 


8. Derive the rate of convergence of the method of steepest descent by viewing it as a 
one-step optimal process. 


9. Let P¥(Q) = cg +c,Q+c,Q’+:--+c,,Q” be the optimal polynomial in (29) minimizing 
(30). Show that the c,’s can be found explicitly by solving the vector equation 


gl Qe, gi Qs, --- Bf Qr'gi co ae 
£, QO’, 2, Q'g, aes g,Q”"*?g, cy g, Og, 


leron'g ve ge Q’nt'g, | Cm} SQ" B 


Show that this reduces to steepest descent when m = 0. 


10. Show that for the method of conjugate directions there holds 


tas 2k 
E(x,) <4 (2) E(Xo), 
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where y = a/A and a and A are the smallest and largest eigenvalues of Q. Hint: In (27) 
select P,_,(A) so that 
(- +a— *) 
,{ —— 
A-a 
1+AP, Se 
+ k-1 (A) ( A+ 2) > 
qT, 
A-a 
where T,(A) = cos (k arc cos A) is the kth Chebyshev polynomial. This choice gives 
the minimum maximum magnitude on [a, A]. Verify and use the inequality 


(1-9 <(iy 
(+ /y*+0— Sy ~\it yy] * 


11. Suppose it is known that each eigenvalue of Q lies either in the interval [a, A] or in 
the interval [a+ A,A+ A] where a, A, and A are all positive. Show that the partial 
conjugate gradient method restarted every two steps will converge with a ratio no 
greater than [(A — a)/(A+a)]? no matter how large A is. 


12. Modify the first method given in Section 9.6 so that it is globally convergent. 


13. Show that in the purely quadratic form of the conjugate gradient method d/Qd, = 
—d/Qg,. Using this show that to obtain x,,, from x, it is necessary to use Q only to 
evaluate g, and Qg,. 


14. Show that in the quadratic problem Qg, can be evaluated by taking a unit step from x, 
in the direction of the negative gradient and evaluating the gradient there. Specifically, 


if y, =X, —g, and p, = Vf(y,)", then Qg, = g; — Px. 
15. Combine the results of Exercises 13 and 14 to derive a conjugate gradient method for 


general problems much in the spirit of the first method of Section 9.6 but which does 
not require knowledge of F(x,) or a line search. 
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Chapter 10 QUASI-NEWTON 
METHODS 


In this chapter we take another approach toward the development of methods lying 
somewhere intermediate to steepest descent and Newton’s method. Again working 
under the assumption that evaluation and use of the Hessian matrix is impractical 
or costly, the idea underlying quasi-Newton methods is to use an approximation to 
the inverse Hessian in place of the true inverse that is required in Newton’s method. 
The form of the approximation varies among different methods—ranging from 
the simplest where it remains fixed throughout the iterative process, to the more 
advanced where improved approximations are built up on the basis of information 
gathered during the descent process. 

The quasi-Newton methods that build up an approximation to the inverse 
Hessian are analytically the most sophisticated methods discussed in this book for 
solving unconstrained problems and represent the culmination of the development 
of algorithms through detailed analysis of the quadratic problem. As might be 
expected, the convergence properties of these methods are somewhat more difficult 
to discover than those of simpler methods. Nevertheless, we are able, by continuing 
with the same basic techniques as before, to illuminate their most important features. 

In the course of our analysis we develop two important generalizations of 
the method of steepest descent and its corresponding convergence rate theorem. 
The first, discussed in Section 10.1, modifies steepest descent by taking as the 
direction vector a positive definite transformation of the negative gradient. The 
second, discussed in Section 10.8, is a combination of steepest descent and Newton’s 
method. Both of these fundamental methods have convergence properties analogous 
to those of steepest descent. 


10.1 MODIFIED NEWTON METHOD 


A very basic iterative process for solving the problem 
minimize f (x) 


which includes as special cases most of our earlier ones is 


285 
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Xp41 =X, — aS, VE came (1) 


where S, is a symmetric n x n matrix and where, as usual, a, is chosen to minimize 
f(X;,.,). If S, is the inverse of the Hessian of f, we obtain Newton’s method, while 
if S, =I we have steepest descent. It would seem to be a good idea, in general, 
to select S, as an approximation to the inverse of the Hessian. We examine that 
philosophy in this section. 

First, we note, as in Section 8.8, that in order that the process (1) be guaranteed 
to be a descent method for small values of a, it is necessary in general to require 
that S, be positive definite. We shall therefore always impose this as a requirement. 

Because of the similarity of the algorithm (1) with steepest descent’ it should 
not be surprising that its convergence properties are similar in character to our 
earlier results. We derive the actual rate of convergence by considering, as usual, 
the standard quadratic problem with 


f (x) = 5x"Qx—b’x, (2) 


where Q is symmetric and positive definite. For this case we can find an explicit 
expression for a, in (1). The algorithm becomes 


Xp41 =X; — AS, Bq, (3a) 

where 
8, = Qx, —b (3b) 
gS; gk (3c) 


= g/S,QS,g, 


We may then derive the convergence rate of this algorithm by slightly extending 
the analysis carried out for the method of steepest descent. 


Modified Newton Method Theorem (Quadratic case). Let x* be the unique 
minimum point of f, and define E(x) = i (x —x*)’Q(x —x*). 
Then for the algorithm (3) there holds at every step k 


Ets) < (Po) Bo), @) 


where b, and B, are, respectively, the smallest and largest eigenvalues of the 
matrix §,Q. 


*The algorithm (1) is sometimes referred to as the method of deflected gradients, since the 
direction vector can be thought of as being determined by deflecting the gradient through 
multiplication by S;. 
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Proof. We have by direct substitution 


E(%)—E (ess) (8FSiBi) 
E (x,) (2/S,QS;,g,) (g/Q-'g,) 


Letting T, = Qs,” and p, = Se, we obtain 


E(%)—E Cn) (PEP x)” 
E (X,) (pi T,P;) (pi T;'Px) 


From the Kantorovich inequality we obtain easily 


B= bk)’ 
E(®&a1) E ; 
Gin) < (FI) Bou) 


where b, and B, are the smallest and largest eigenvalues of T,. Since Si 15; _ 
S,Q, we see that S,Q is similar to T, and therefore has the same eigenvalues. 


This theorem supports the intuitive notion that for the quadratic problem one 
should strive to make S, close to Q~' since then both b, and B, would be close 
to unity and convergence would be rapid. For a nonquadratic objective function f 
the analog to Q is the Hessian F(x), and hence one should try to make S, close to 
F(x,)"1. 

Two remarks may help to put the above result in proper perspective. The 
first remark is that both the algorithm (1) and the theorem stated above are only 
simple, minor, and natural extensions of the work presented in Chapter 8 on steepest 
descent. As such the result of this section can be regarded, correspondingly, not as 
a new idea but as an extension of the basic result on steepest descent. The second 
remark is that this one simple result when properly applied can quickly characterize 
the convergence properties of some fairly complex algorithms. Thus, rather than 
an isolated result concerned with a specific form of algorithm, the theorem above 
should be regarded as a general tool for convergence analysis. It provides significant 
insight into various quasi-Newton methods discussed in this chapter. 


A Classical Method 


We conclude this section by mentioning the classical modified Newton’s method, a 
standard method for approximating Newton’s method without evaluating F(x,)~! 
for each k. We set 


X41 =X, — @; [F (Xo)} Vf (x,)" . (5) 


In this method the Hessian at the initial point x) is used throughout the process. 
The effectiveness of this procedure is governed largely by how fast the Hessian is 
changing—in other words, by the magnitude of the third derivatives of f. 
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10.2 CONSTRUCTION OF THE INVERSE 


The fundamental idea behind most quasi-Newton methods is to try to construct 
the inverse Hessian, or an approximation of it, using information gathered as the 
descent process progresses. The current approximation H, is then used at each stage 
to define the next descent direction by setting S, = H, in the modified Newton 
method. Ideally, the approximations converge to the inverse of the Hessian at the 
solution point and the overall method behaves somewhat like Newton’s method. 
In this section we show how the inverse Hessian can be built up from gradient 
information obtained at various points. 

Let f be a function on E” that has continuous second partial derivatives. 
If for two points x,,,, xX, we define g,,, = Vf(x,,,)", g = V/(x,)’ and p, = 
X,41 — X,, then 


Sin — Se =F (&,) py. (6) 


If the Hessian, F, is constant, then we have 


Gi = Se41 — Be = F py, (7) 


and we see that evaluation of the gradient at two points gives information about F. 
If n linearly independent directions po, P,, Po, ---, P,—; and the corresponding q,’s 
are known, then F is uniquely determined. Indeed, letting P and Q be the nxn 
matrices with columns p, and q, respectively, we have 


F=QP"!. (8) 


It is natural to attempt to construct successive approximations H, to F~! based 
on data obtained from the first k steps of a descent process in such a way that if 
F were constant the approximation would be consistent with (7) for these steps. 
Specifically, if F were constant H,,, would satisfy 


Hy..14; = Pi; O<i<k. (9) 


After n linearly independent steps we would then have H, = F"!. 

For any k < nthe problem of constructing a suitable H,, which in general serves as 
an approximation to the inverse Hessian and which in the case of constant F satisfies (9), 
admits an infinity of solutions, since there are more degrees of freedom than there are 
constraints. Thus a particular method can take into account additional considerations. 
We discuss below one of the simplest schemes that has been proposed. 


Rank One Correction 


Since F and F~! are symmetric, it is natural to require that H,, the approximation 
to F-', be symmetric. We investigate the possibility of defining a recursion of 
the form 


Hy) = Hy, + a,%2; (10) 
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which preserves symmetry. The vector z, and the constant a, define a matrix of 
(at most) rank one, by which the approximation to the inverse is updated. We select 
them so that (9) is satisfied. Setting i equal to k in (9) and substituting (10) we obtain 


Pa = Were = Wide + 2427 We (11) 
Taking the inner product with q, we have 
2 
q; Px — q; H,.4: = a, (2, 4;) : (12) 
On the other hand, using (11) we may write (10) as 


(px — Hy. q;.) (Px — Hy,4,)" 


ay (zl a.) 


Hy. =H, + 


which in view of (12) leads finally to 


(p, — Hy. q,) (Py — H,4,)" 


H,., =H,+ 
— ' qi (p, — H,.4;.) 


(13) 

We have determined what a rank one correction must be if it is to satisfy (9) 
for i= k. It remains to be shown that, for the case where F is constant, (9) is also 
satisfied for i < k. This in turn will imply that the rank one recursion converges to 
F~' after at most n steps. 


Theorem. Let F be a fixed symmetric matrix and suppose that Po, Pj, 
Po,---,P, are given vectors. Define the vectors q; = Fp; i=0,1,2,...,k. 
Starting with any initial symmetric matrix Hp let 


(p; — H,q,) (p; — H,,)" 
qi (p;— H,q,) 


H,,,; =H,;+ (14) 


Then 


Pp, =H,.14; for ick. (15) 


Proof. The proof is by induction. Suppose it is true for H,, and i< k—1. The 
relation was shown above to be true for H,,; andi =k. Fori<k 


Hy4.19; = H,q;+y,(pi 4; — 4; H.4)). (16) 
where 


= (py — Hi4y) 
qi (P, — Hy, 4) 


k 


By the induction hypothesis, (16) becomes 
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Hy.14) = Pi + Yx (Pi — 4; P:) - 
From the calculation 


4; Pi = PL Fp; = Pdi. 
it follows that the second term vanishes. | 


To incorporate the approximate inverse Hessian in a descent procedure while 
simultaneously improving it, we calculate the direction d, from 


d, = —H,g, 


and then minimize f(x,-+ad,) with respect to a > 0. This determines x,,, = 
x, +a,d,, py, = a,d,, and g,,,. Then H,,, can be calculated according to (13). 

There are some difficulties with this simple rank one procedure. First, the 
updating formula (13) preserves positive definiteness only if q/(p,—H,q,) > 0, 
which cannot be guaranteed (see Exercise 6). Also, even if q/ (p, —H,.q,) is positive, 
it may be small, which can lead to numerical difficulties. Thus, although an excellent 
simple example of how information gathered during the descent process can in 
principle be used to update an approximation to the inverse Hessian, the rank one 
method possesses some limitations. 


10.3. DAVIDON-FLETCHER-POWELL METHOD 


The earliest, and certainly one of the most clever schemes for constructing the inverse 
Hessian, was originally proposed by Davidon and later developed by Fletcher and 
Powell. It has the fascinating and desirable property that, for a quadratic objective, 
it simultaneously generates the directions of the conjugate gradient method while 
constructing the inverse Hessian. At each step the inverse Hessian is updated 
by the sum of two symmetric rank one matrices, and this scheme is therefore 
often referred to as a rank two correction procedure. The method is also often 
referred to as the variable metric method, the name originally suggested by Davidon. 

The procedure is this: Starting with any symmetric positive definite matrix H,, 
any point Xj, and with k = 0, 


Step 1. Set d, = —H,g,. 


Step 2. Minimize f(x,-+ad,) with respect to a > 0 to obtain x,,,, Py = a, d,, 
and gy... 


Step 3. Set q, = 8.4; — g, and 


p.p, H,4,4; H, 


H,,, =H,+ 
~ ‘ P.O qi Hig, 


(17) 


Update k and return to Step 1. 
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Positive Definiteness 


We first demonstrate that if H;, is positive definite, then so is H,,,. For any x € E” 
we have 


(x"p,). (x"H.4,) 


x’H,,.x =x’ H,x+ = ait (18) 
Defining a = H,°x, b= Hyq, we may rewrite (18) as 
Wx = (278) 0"D) = 7H) (8"P,) 
(b"b) Pi 
We also have 
Pi = Pe B41 — PeBe = —Pr&- (19) 
since 
Pi S41 = 0, (20) 
because x,,, is the minimum point of f along p,. Thus by definition of p, 
Ped = O48; Hig, (21) 
and hence 
Vx = 278) 07) = (07D)? (a) 2 


(bb) a, 8; Ag, 


Both terms on the right of (22) are nonnegative—the first by the Cauchy—Schwarz 
inequality. We must only show they do not both vanish simultaneously. The first 
term vanishes only if a and b are proprotional. This in turn implies that x and q, 
are proportional, say x = Bq,. In that case, however, 


pix = BP; 4 = Ba, 8, Hyg, #0 


from (21). Thus x’H,,,x > 0 for all nonzero x. 

It is of interest to note that in the proof above the fact that a, is chosen as 
the minimum point of the line search was used in (20), which led to the important 
conclusion pj q, > 0. Actually any a,, whether the minimum point or not, that 
gives pj q, > 0 can be used in the algorithm, and H,,,, will be positive definite (see 
Exercises 8 and 9). 
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Finite Step Convergence 


We assume now that f is quadratic with (constant) Hessian F. We show in this 
case that the Davidon—Fletcher—Powell method produces direction vectors p, that 
are F-orthogonal and that if the method is carried n steps then H, = F"!. 


Theorem. If f is quadratic with positive definite Hessian F, then for the 
Davidon—Fletcher—Powell method 


p; Fp; = 0, O<i<j<k (23) 
H.iFpj=p, for 0<ick. (24) 


Proof. We note that for the quadratic case 
Gi = Seri — Be = Fx,4, — Fx, = Fp,. (25) 
Also 


Hy. Fp, = Hpk = Px (26) 


from (17). 
We now prove (23) and (24) by induction. From (26) we see that they are true 
for k = 0. Assuming they are true for k— 1, we prove they are true for k. We have 


= Si t F (Diy +--+ +Py-1)- 


Therefore from (23) and (20) 


Pf,=pie4.i:=0 for O<i<k. (27) 
Hence from (24) 
p; FH, g, = 0. (28) 
Thus since p, = —a,H,g, and since a, 4 0, we obtain 
pi Fp,=0 for i<k, (29) 


which proves (23) for k. 
Now since from (24) for k — 1, (25) and (29) 


q,H,Fp; =4(P;=p,Fp;=0, O<i<k 
we have 
H,.., Fp; = HF p; = p;, O0<i<k. 


This together with (26) proves (24) for k. If 
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Since the p,’s are F-orthogonal and since we minimize f successively in these 
directions, we see that the method is a conjugate direction method. Furthermore, 
if the initial approximation Hy is taken equal to the identity matrix, the method 
becomes the conjugate gradient method. In any case the process obtains the overall 
minimum point within 7 steps. 

Finally, (24) shows that po, Pp), Po,---,P, are eigenvectors corresponding to 
unity eigenvalue for the matrix H,,.,F. These eigenvectors are linearly independent, 
since they are F-orthogonal, and therefore H,, = F~!. 


10.4 THE BROYDEN FAMILY 


The updating formulae for the inverse Hessian considered in the previous two 
sections are based on satisfying 


Huidgi =P; O<I<k, (30) 
which is derived from the relation 
qi=Fp, O<i<k, (31) 


which would hold in the purely quadratic case. It is also possible to update approx- 
imations to the Hessian F itself, rather than its inverse. Thus, denoting the kth 
approximation of F by B,, we would, analogously, seek to satisfy 


q; = Bip; O<ick. (32) 


Equation (32) has exactly the same form as (30) except that q; and p, are 
interchanged and H is replaced by B. It should be clear that this implies that 
any update formula for H derived to satisfy (30) can be transformed into a corre- 
sponding update formula for B. Specifically, given any update formula for H, the 
complementary formula is found by interchanging the roles of B and H and of q 
and p. Likewise, any updating formula for B that satisfies (32) can be converted 
by the same process to a complementary formula for updating H. It is easily seen 
that taking the complement of a complement restores the original formula. 

To illustrate complementary formulae, consider the rank one update of 
Section 10.2, which is 


(p, — Hy. q,) (Py — H,4,)" 


H,.., = H, + (33) 
sa ‘ qi (P. — Hy. ax) 
The corresponding complementary formula is 
—B, .—B,p,)" 
Baht (4: Px) (Qh Px) (34) 


Pi, (a, — B,p,) 
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Likewise, the Davidon—Fletcher—Powell (or simply DFP) formula is 


PrPi H, A H, 


Hp =H,+ (35) 
Pde EG, 
and its complement is 
.q; 6B, p,.piB 
Bt=8,4 a kPKPx Px (36) 


qiP, = PLB, 


This last update is known as the Broyden—Fletcher—Goldfarb—Shanno update of B,, 
and it plays an important role in what follows. 

Another way to convert an updating formula for H to one for B or vice versa 
is to take the inverse. Clearly, if 


Hy.i4;=P;, O<i<k, (37) 
then 
q; = He p;- 0<ick, (38) 


which implies that H,!, satisfies (32), the criterion for an update of B. Also, most 
importantly, the inverse of a rank two formula is itself a rank two formula. 

The new formula can be found explicitly by two applications of the general 
inversion identity (often referred to as the Sherman—Morrison formula) 


A7'ab’ A7! 


A b? ee. eee ees 
rab" 1+b?A—a 


(39) 
where A is an m x n matrix, and a and b are n-vectors, which is valid provided the 
inverses exist. (This is easily verified by multiplying through by A+ ab’.) 

The Broyden—Fletcher—Goldfard—Shanno update for B produces, by taking the 
inverse, a corresponding update for H of the form 


HEFSS — Ha ( Lr ihe) PrP, p,q; A, + Add (40) 


oo G4: PL qi P; 


This is an important update formula that can be used exactly like the DFP formula. 
Numerical experiments have repeatedly indicated that its performance is superior 
to that of the DFP formula, and for this reason it is now generally preferred. 

It can be noted that both the DFP and the BFGS updates have symmetric 
rank two corrections that are constructed from the vectors p, and H,q,. Weighted 
combinations of these formulae will therefore also be of this same type (symmetric, 
rank two, and constructed from p, and H,q,). This observation naturally leads 
to consideration of a whole collection of updates, known as the Broyden family, 
defined by 


H? = (1—) HP + fHPFSs, (41) 
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where @ is a parameter that may take any real value. Clearly 6 = 0 and d= 1 
yield the DFP and BFGS updates, respectively. The Broyden family also includes 
the rank one update (see Exercise 12). 

An explicit representation of the Broyden family can be found, after a fair 
amount of algebra, to be 


T H,q,q7H, 
Bebe ui k + bv,v! 
Pa = He (42) 


__ [DEP T 
= Hy, + ovv;,. 


Hi =H, + 


where 


Pig, = Gy 


ve = (qc Aa)” ( Ee aa ) : 
This form will be useful in some later developments. 

A Broyden method is defined as a quasi-Newton method in which at each 
iteration a member of the Broyden family is used as the updating formula. The 
parameter @ is, in general, allowed to vary from one iteration to another, so a 
particular Broyden method is defined by a sequence ¢,, 5, ..., of parameter values. 
A pure Broyden method is one that uses a constant ¢. 

Since both HPF? and H®®°S satisfy the fundamental relation (30) for updates, 
this relation is also satisfied by all members of the Broyden family. Thus it can 
be expected that many properties that were found to hold for the DFP method will 
also hold for any Broyden method, and indeed this is so. The following is a direct 
extension of the theorem of Section 10.3. 


Theorem. If f is quadratic with positive definite Hessian ¥F, then for a Broyden 
method 


pFp;=0, O<i<j<k 
H,.., Fp; = p; for O0<ik<k. 


Proof. The proof parallels that of Section 10.3, since the results depend only on 
the basic relation (30) and the orthogonality (20) because of exact line search. J 


The Broyden family does not necessarily preserve positive definiteness of H® 
for all values of ¢. However, we know that the DFP method does preserve positive 
definiteness. Hence from (42) it follows that positive definiteness is preserved for 
any ¢ > 0, since the sum of a positive definite matrix and a positive semidefinite 
matrix is positive definite. For # < 0 there is the possibility that H’ may become 
singular, and thus special precautions should be introduced. In practice @ > 0 is 
usually imposed to avoid difficulties. 

There has been considerable experimentation with Broyden methods to 
determine superior strategies for selecting the sequence of parameters ¢,. 
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The above theorem shows that the choice is irrelevant in the case of a quadratic 
objective and accurate line search. More surprisingly, it has been shown that 
even for the case of nonquadratic functions and accurate line searches, the points 
generated by all Broyden methods will coincide (provided singularities are avoided 
and multiple minima are resolved consistently). This means that differences in 
methods are important only with inaccurate line search. 

For general nonquadratic functions of modest dimension, Broyden methods 
seem to offer a combination of advantages as attractive general procedures. First, 
they require only that first-order (that is, gradient) information be available. Second, 
the directions generated can always be guaranteed to be directions of descent by 
arranging for H, to be positive definite throughout the process. Third, since for a 
quadratic problem the matrices H, converge to the inverse Hessian in at most n 
steps, it might be argued that in the general case H, will converge to the inverse 
Hessian at the solution, and hence convergence will be superlinear. Unfortunately, 
while the methods are certainly excellent, their convergence characteristics require 
more careful analysis, and this will lead us to an important additional modification. 


Partial Quasi-Newton Methods 


There is, of course, the option of restarting a Broyden method every m+ | steps, 
where m+ 1 <n. This would yield a partial quasi-Newton method that, for small 
values of m, would have modest storage requirements, since the approximate inverse 
Hessian could be stored implicitly by storing only the vectors p; and q;,i << m-+1.In 
the quadratic case this method exactly corresponds to the partial conjugate gradient 
method and hence it has similar convergence properties. 


10.56 CONVERGENCE PROPERTIES 


The various schemes for simultaneously generating and using an approximation 
to the inverse Hessian are difficult to analyze definitively. One must therefore, to 
some extent, resort to the use of analogy and approximate analyses to determine 
their effectiveness. Nevertheless, the machinery we developed earlier provides a 
basis for at least a preliminary analysis. 


Global Convergence 


In practice, quasi-Newton methods are usually executed in a continuing fashion, 
starting with an initial approximation and successively improving it throughout the 
iterative process. Under various and somewhat stringent conditions, it can be proved 
that this procedure is globally convergent. If, on the other hand, the quasi-Newton 
methods are restarted every n or n+ 1 steps by resetting the approximate inverse 
Hessian to its initial value, then global convergence is guaranteed by the presence 
of the first descent step of each cycle (which acts as a spacer step). 
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Local Convergence 


The local convergence properties of quasi-Newton methods in the pure form 
discussed so far are not as good as might first be thought. Let us focus on the 
local convergence properties of these methods when executed with the restarting 
feature. Specifically, consider a Broyden method and for simplicity assume that at 
the beginning of each cycle the approximate inverse Hessian is reset to the identity 
matrix. Each cycle, if at least n steps in duration, will then contain one complete 
cycle of an approximation to the conjugate gradient method. Asymptotically, in 
the tail of the generated sequence, this approximation becomes arbitrarily accurate, 
and hence we may conclude, as for any method that asymptotically approaches 
the conjugate gradient method, that the method converges superlinearly (at least if 
viewed at the end of each cycle). Although superlinear convergence is attractive, 
the fact that in this case it hinges on repeated cycles of n steps in duration can 
seriously detract from its practical significance for problems with large n, since we 
might hope to terminate the procedure before completing even a single full cycle 
of n steps. 

To obtain insight into the defects of the method, let us consider a special 
situation. Suppose that f is quadratic and that the eigenvalues of the Hessian, F, 
of f are close together but all very large. If, starting with the identity matrix, an 
approximation to the inverse Hessian is updated m times, the matrix H,,,F will 
have m eigenvalues equal to unity and the rest will still be large. Thus, the ratio 
of smallest to largest eigenvalue of H,,F, the condition number, will be worse 
than for F itself. Therefore, if the updating were discontinued and H,, were used 
as the approximation to F~! in future iterations according to the procedure of 
Section 10.1, we see that convergence would be poorer than it would be for ordinary 
steepest descent. In other words, the approximations to F~' generated by the 
updating formulas, although accurate over the subspace traveled, do not necessarily 
improve and, indeed, are likely to worsen the eigenvalue structure of the iteration 
process. 

In practice a poor eigenvalue structure arising in this manner will play a 
dominating role whenever there are factors that tend to weaken its approximation 
to the conjugate gradient method. Common factors of this type are round-off errors, 
inaccurate line searches, and nonquadratic terms in the objective function. Indeed, 
it has been frequently observed, empirically, that performance of the DFP method 
is highly sensitive to the accuracy of the line search algorithm—to the point where 
superior step-wise convergence properties can only be obtained through excessive 
time expenditure in the line search phase. 


Example. To illustrate some of these conclusions we consider the six-dimensional 
problem defined by 


F(x) = 5x" Qx, 
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where 
400 0 0 0 0 0O 
0 38 0 0 0 O 
_| 0 0 36 0 0 0 
Q= 0 0 O 34 0 O 
0 0 O 0 32 O 
0 0 0 0 0 30 


This function was minimized iteratively (the solution is obviously x* = 0) starting 
at X) = (10, 10, 10, 10, 10, 10), with f(x,) = 10,500, by using, alternatively, the 
method of steepest descent, the DFP method, the DFP method restarted every six 
steps, and the self-scaling method described in the next section. For this quadratic 
problem the appropriate step size to take at any stage can be calculated by a simple 
formula. On different computer runs of a given method, different levels of error 
were deliberately introduced into the step size in order to observe the effect of line 
search accuracy. This error took the form of a fixed percentage increase over the 
optimal value. The results are presented below: 


CasE 1. No error in step size a@ 


Function value 


Iteration Steepest descent DFP DFP (with restart) Self-scaling 

1 96.29630 96.29630 96.29630 96.29630 

2 1.560669 6.900839 x 107! 6.900839 x 107! 6.900839 x 107! 
3 2.932559 x 10°? 3.988497 x 10°37 3.988497 x 10-3 3.988497 x 1073 
4 5.787315 x 10-4 1.683310 x 1075 1.683310 x 107> 1.683310 x 10> 
5 1.164595 x 10> 3.878639 x 10-8 3.878639 x 10-8 3.878639 x 10-8 
6 2.359563 x 10-7 


CASE 2. 0.1% error in step size a@ 


Function value 


Iteration Steepest descent DFP DFP (with restart) Self-scaling 

1 96.30669 96.30669 96.30669 96.30669 

2 1.564971 6.994023 x 107! 6.994023 x 107! 6.902072 x 107! 
3 2.939804 x 1077 1.225501 x 10°? 1.225501 x 10-7 3.989507 x 10°73 
4 5.810123 x 10-4 ~— 7.301088 x 10-3 ——- 7.301088 x 10-3 1.684263 x 10-5 
5 1.169205 x 10-5 2.636716 x 10-3. 2.636716 x 10-3 3.881674 x 10-8 
6 2.372385 x 1077 1.031086 x 107° 1.031086 x 1075 

7 3.633330 x 10°? ~—- 2.399278 x 10-8 
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Case 3. 1% error in step size a@ 


Function value 


Iteration Steepest descent DFP DFP (with restart) Self-scaling 

1 97.33665 97.33665 97.33665 97.33665 

2 1.586251 1.621908 1.621908 0.7024872 

3 2.989875 x 10-7 8.268893 x 107! 8.268893 x 107! 4.090350 x 10-3 
4 5.908101 x 10-4 4.302943 x 107! 4.302943 x 107! 1.779424 x 10-5 
5 1.194144x 107° 4.449852 x 10-3 4.449852 x 10-3 4.195668 x 10-8 
6 2.422985 x 1077 5.337835 x 10-5 5.337835 x 10-5 

7 3.767830 x 10°° 4.493397 x 1077 

8 3.768097 x 10~° 


Case 4. 10% error in step size a@ 


Function value 


Iteration Steepest descent DFP DFP (with restart) Self-scaling 
1 200.333 200.333 200.333 200.333 
2 2.732789 93.65457 93.65457 2.811061 
3 3.836899 x 10-7 56.92999 56.92999 3.562769 x 10-7 
4 6.376461 x 10-4 1.620688 1.620688 4.200600 x 10-4 
5 1.219515 x 10> 5.251115 x 107! 5.251115 x 107! 4.726918 x 10~° 
6 2.457944 x 1077 3.323745 x 107! 3.323745 x 107! 
7 6.150890 x 10-3 8.102700 x 1077 
8 3.025393 x 10-3 2.973021 x 10-3 
9 3.025476 x 10> 1.950152 x 10-3 
10 3.025476 x 1077 2.769299 x 10> 
11 1.760320 x 10-5 
12 1.123844 x 10~¢ 


We note first that the error introduced is reported as a percentage of the step 
size itself. In terms of the change in function value, the quantity that is most often 
monitored to determine when to terminate a line search, the fractional error is the 
square of that in the step size. Thus, a one percent error in step size is equivalent 
to a 0.01% error in the change in function value. 

Next we note that the method of steepest descent is not radically affected by an 
inaccurate line search while the DFP methods are. Thus for this example while DFP 
is superior to steepest descent in the case of perfect accuracy, it becomes inferior 
at an error of only 0.1% in step size. 


10.6 SCALING 


There is a general viewpoint about what makes up a desirable descent method that 
underlies much of our earlier discussions and which we now summarize briefly in 
order to motivate the presentation of scaling. A method that converges to the exact 
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solution after n steps when applied to a quadratic function on E” has obvious appeal 
especially if, as is usually the case, it can be inferred that for nonquadratic problems 
repeated cycles of length n of the method will yield superlinear convergence. For 
problems having large n, however, a more sophisticated criterion of performance 
needs to be established, since for such problems one usually hopes to be able to 
terminate the descent process before completing even a single full cycle of length 
n. Thus, with these sorts of problems in mind, the finite-step convergence property 
serves at best only as a sign post indicating that the algorithm might, make rapid 
progress in its early stages. It is essential to insure that in fact it will make rapid 
progress at every stage. Furthermore, the rapid convergence at each step must not 
be tied to an assumption on conjugate directions, a property easily destroyed by 
inaccurate line search and nonquadratic objective functions. With this viewpoint it 
is natural to look for quasi-Newton methods that simultaneously possess favorable 
eigenvalue structure at each step (in the sense of Section 10.1) and reduce to the 
conjugate gradient method if the objective function happens to be quadratic. Such 
methods are developed in this section. 


Improvement of Eigenvalue Ratio 


Referring to the example presented in the last section where the Davidon—Fletcher— 
Powell method performed poorly, we can trace the difficulty to the simple obser- 
vation that the eigenvalues of H,Q are all much larger than unity. The DFP 
algorithm, or any Broyden method, essentially moves these eigenvalues, one at a 
time, to unity thereby producing an unfavorable eigenvalue ratio in each H,Q for 
1 <k <n. This phenomenon can be attributed to the fact that the methods are 
sensitive to simple scale factors. In particular if H) were multiplied by a constant, 
the whole process would be different. In the example of the last section, if Hy were 
scaled by, for instance, multiplying it by 1/35, the eigenvalues of Hj)Q would be 
spread above and below unity, and in that case one might suspect that the poor 
performance would not show up. 

Motivated by the above considerations, we shall establish conditions under 
which the eigenvalue ratio of H,,,F is at least as favorable as that of H,F in a 
Broyden method. These conditions will then be used as a basis for introducing 
appropriate scale factors. 

We use (but do not prove) the following matrix theoretic result due to Loewner. 


Interlocking Eigenvalues Lemma. Let the symmetric n x n matrix A have 
eigenvalues A, < A, <... <A, Let a be any vector in E" and denote the 
eigenvalues of the matrix A+aa™ by pw, < py... < fb, Then Ay < py <A < 
My» S An < Mn 


For convenience we introduce the following definitions: 
R, =F, Fy” 


1/2 
r, = Fp. 
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Then using q, = ya oe it can be readily verified that (42) is equivalent to 


E Yi 
Rr, Ry, | rr; 


Rei =R, =e PU,Zi, (43) 


vi Tr: 
r, R,r;, r, 0; 


where 


Yr, R,r, 
m= Fy, = /rfRin (st —— ; 
r,U, Vv, Rit 


Since R, is similar to H,F (because H,F = F'’’R,F'/”), both have the same eigen- 
values. It is most convenient, however, in view of (43) to study R,, obtaining 
conclusions about H,F indirectly. 

Before proving the general theorem we shall consider the case @ = 0 corre- 
sponding to the DFP formula. Suppose the eigenvalues of R, are A,,A,,...,A, 
with 0< A, <A, <...<A,. Suppose also that 1 € [A,,A,,]. We will show that 
the eigenvalues of R,,., are all contained in the interval [A,, A,,], which of course 
implies that R,,, is no worse than R, in terms of its condition number. Let us first 
consider the matrix 


T 
_ R,7,7; Ry 


P=R . 
‘rT Rr, 


We see that Pr, = 0 so one eigenvalue of P is zero. If we denote the eigenvalues 
of P by uw, <M. <... <p, we have from the above observation and the lemma 
on interlocking eigenvalues that 


Next we consider 


T T T 
R.=R Rare, Re rT; 
k+1 ~~ “Sk 


= 44) 
ve T Paw” ( 
r, Rt, r,.0, r,0; 


Since r, is an eigenvector of P and since, by symmetry, all other eigenvectors of 
P are therefore orthogonal to r,, it follows that the only eigenvalue different in 
R,,, from in P is the one corresponding to r,—it now being unity. Thus R,,, 
has eigenvalues f, @3,...,/4, and unity. These are all contained in the interval 
[A,,A,,]. Thus updating does not worsen the eigenvalue ratio. It should be noted 
that this result in no way depends on a, being selected to minimize /f. 

We now extend the above to the Broyden class withO< ¢< 1. 


Theorem. Let the n eigenvalues of H,¥ be 4,, Az,...,A, withO<A, <A, < 
.. <A, Suppose that 1 € [A,, A,]. Then for any 6,0< ¢ < 1, the eigenvalues 


of H?..F, where He, is defined by (42), are all contained in [A,, X,]- 
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Proof. The result shown above corresponds to ¢ = 0. Let us now consider ¢ = 1, 
corresponding to the BFGS formula. By our original definition of the BFGS update, 
H~! is defined by the formula that is complementary to the DFP formula. Thus 


q.9h He p,p, H,' 


H,!, =H;'+ 
ee ap, BHP, 


This is equivalent to 


R,'r,zTR;! or xt 


Re = Re! (45) 


Tp! T.? 
r, Ror rery 


which is identical to (44) except that R, is replaced by R;'. 

The eigenvalues of R;' are 1/A, < 1/A,_, <... < 1/A,. Clearly, 1 € 
[1/A,,, 1/A,]. Thus by the preliminary result, if the eigenvalues of R,', are denoted 
1/p, < 1/py-) <-... < 1/py, it follows that they are contained in the interval 
[1/A,, 1/A,]. Thus 1/A,, < 1/m,, and 1/A, > 1/,. When inverted this yields uw, > 
A, and wt, < A,,, which shows that the eigenvalues of R,,, are contained in [A,, A,,]. 
This establishes the result for @ = 1. 

For general ¢@ the matrix Re, defined by (43) has eigenvalues that are all 
monotonically increasing with ¢@ (as can be seen from the interlocking eigenvalues 
lemma). However, from above it is known that these eigenvalues are contained in 
[A,,A,] for ¢ =0 and @ = 1. Hence, they must be contained in [A,, A,,] for all 


6,0<¢<1.4 


Scale Factors 


In view of the result derived above, it is clearly advantageous to scale the matrix H, 
so that the eigenvalues of H,F are spread both below and above unity. Of course 
in the ideal case of a quadratic problem with perfect line search this is strictly only 
necessary for Hy, since unity is an eigenvalue of H,F for k > 0. But because of 
the inescapable deviations from the ideal, it is useful to consider the possibility of 
scaling every H,. 

A scale factor can be incorporated directly into the updating formula. We first 
multiply H, by the scale factor y, and then apply the usual updating formula. This 
is equivalent to replacing H, by y,H, in (43) and leads to 


H,4q,4; H, T P.Pi 
H,,, = | H, - ———*—_ LV, : 46 
k+1 ( k q/H,q, + DVEV | Ve pq, (46) 


This defines a two-parameter family of updates that reduces to the Broyden family 
for y, = 1. 

Using Yo, y;,-.- aS arbitrary positive scale factors, we consider the algorithm: 
Start with any symmetric positive definite matrix Hp) and any point xX, then starting 
with k =0, 
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Step 1. Set d, = —H,g,. 


Step 2. Minimize f(x, -+ad,) with respect to a > 0 to obtain x,,,,P, = a,d,, 
and 441. 


Step 3. Set q, = $4; — g, and 


PPL 
Pid 


H,q,q/H 
H. = (1. a k 
qi Aa 


Py H,.4. 
y= (af Ha,)"" (Pe i 
. alli Pra =a Ae, 


ole PVE, ) Vet 
(47) 


The use of scale factors does destroy the property H,, = F~! in the quadratic case, 
but it does not destroy the conjugate direction property. The following properties of 
this method can be proved as simple extensions of the results given in Section 10.3. 


1. If H, is positive definite and pjq, > 0, (47) yields an H,,, that is positive 
definite. 


2. If f is quadratic with Hessian F, then the vectors pp, p;,...,P,—, are mutually 
F-orthogonal, and, for each k, the vectors po, p,,...,P, are eigenvectors of 
H,.iF. 


We can conclude that scale factors do not destroy the underlying conjugate 
behavior of the algorithm. Hence we can use scaling to ensure good single-step 
convergence properties. 


A Self-Scaling Quasi-Newton Algorithm 


The question that arises next is how to select appropriate scale factors. If A, < 
A, <... <A, are the eigenvalues of H,F, we want to multiply H, by y, where 
A, <1/y, < 4,. This will ensure that the new eigenvalues contain unity in the 
interval they span. 

Note that in terms of our earlier notation 


qc Ha, Rey 


Pid rely 
Recalling that R, has the same eigenvalues as H,F and noting that for any r, 


T 
r, Ry; 


E, 
r.Vy 


we see that 


(48) 


serves as a suitable scale factor. 
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We now state a complete self-scaling, restarting, quasi-Newton method based 
on the ideas above. For simplicity we take ¢ = 0 and thus obtain a modification of 
the DFP method. Start at any point xp, k = 0. 


Step 1. Set H, =I. 
Step 2. Set d, = —H,g,. 


Step 3. Minimize f(x,-+ad,) with respect to a > 0 to obtain a,, X41, Py = & d,, 
Zi.) and q, = 8,,, — ,- (Select a, accurately enough to ensure pq, > 0.) 


Step 4. If k is not an integer multiple of 7, set 


H,q,q/H : PL 
Hae (1, sua ‘) Pid PP (49) 
Qe Ge J A Ae Pe Ak 


Add one to & and return to Step 2. If k is an integer multiple of n, return to 
Step 1. 

This algorithm was run, with various amounts of inaccuracy introduced in the 
line search, on the quadratic problem presented in Section 10.4. The results are 
presented in that section. 


10.7 MEMORYLESS QUASI-NEWTON METHODS 


The preceding development of quasi-Newton methods can be used as a basis for 
reconsideration of conjugate gradient methods. The result is an attractive class of 
new procedures. 

Consider a simplification of the BFGS quasi-Newton method where H,,, is 
defined by a BFGS update applied to H =I, rather than to H,. Thus H,,, is 
determined without reference to the previous H,, and hence the update procedure 
is memoryless. This update procedure leads to the following algorithm: Start at any 
point xy,k =0. 


Step 1. Set H, =I. (50) 


Step 2. Set d, = —H,g,. (51) 


Step 3. Minimize f(x,-+ad,) with respect to a > 0 to obtain a,, X;,,);, Py = 
a,d,, 8,41, and q, = g,,; — Z,. (Select a, accurately enough to ensure p/q, > 0.) 


Step 4. If k is not an integer multiple of n, set 


H,,, =I- 


G:Pi + Ped +(1 ais PxPk (52) 


Pra) PEO 


Pi A 


10.7 Memoryless Quasi-Newton Methods 305 


Add | to k and return to Step 2. If & is an integer multiple of n, return to 
Step I. 
Combining (51) and (52), it is easily seen that 


QPL Bip + PA Bist ( qi % ) PrP; Bi—1 
dia) = —Big + I+ ‘ (53) 
ee Pia Pia) Pa 
If the line search is exact, then p/g,,, = 0 and hence p/ q, = —p/g,. In this case 
(53) is equivalent to 
Qi Bit 
Gy) = —Be41 + pq, Px (54) 
= S41 + Bedk, 
where 
ee 
B= : 
: 2 Gy 


This coincides exactly with the Polak—Ribiere form of the conjugate gradient 
method. Thus use of the BFGS update in this way yields an algorithm that is of 
the modified Newton type with positive definite coefficient matrix and which is 
equivalent to a standard implementation of the conjugate gradient method when the 
line search is exact. 

The algorithm can be used without exact line search in a form that is similar 
to that of the conjugate gradient method by using (53). This requires storage of 
only the same vectors that are required of the conjugate gradient method. In light 
of the theory of quasi-Newton methods, however, the new form can be expected 
to be superior when inexact line searches are employed, and indeed experiments 
confirm this. 

The above idea can be easily extended to produce a memoryless quasi-Newton 
method corresponding to any member of the Broyden family. The update formula 
(52) would simply use the general Broyden update (42) with H, set equal to I. 
In the case of exact line search (with p/g,,, =), the resulting formula for d,.,; 
reduces to 


qi Skt 
T 
Pi Ak 


id 
Sk +1 


q 
dy) = B41 + (1— ) ata: 


a+? Px- (55) 


We note that (55) is equivalent to the conjugate gradient direction (54) only for 
o = 1, corresponding to the BFGS update. For this reason the choice ¢ = | is 
generally preferred for this type of method. 
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Scaling and Preconditioning 


Since the conjugate gradient method implemented as a memoryless quasi-Newton 
method is a modified Newton method, the fundamental convergence theory based 
on condition number emphasized throughout this part of the book is applicable, as 
are the procedures for improving convergence. It is clear that the function scaling 
procedures discussed in the previous section can be incorporated. 

According to the general theory of modified Newton methods, it is the eigen- 
values of H,F(x,) that influence the convergence properties of these algorithms. 
From the analysis of the last section, the memoryless BFGS update procedure will, 
in the pure quadratic case, yield a matrix H,F that has a more favorable eigenvalue 
ratio than F itself only if the function f is scaled so that unity is contained in the 
interval spanned by the eigenvalues of F. Experimental evidence verifies that at least 
an initial scaling of the function in this way can lead to significant improvement. 
Scaling can be introduced at every step as well, and complete self-scaling can be 
effective in some situations. 

It is possible to extend the scaling procedure to a more general preconditioning 
procedure. In this procedure the matrix governing convergence is changed from 
F(x,) to HF(x,) for some H. If HF(x,) has its eigenvalues all close to unity, 
then the memoryless quasi-Newton method can be expected to perform exceedingly 
well, since it possesses simultaneously the advantages of being a conjugate gradient 
method and being a well-conditioned modified Newton method. 

Preconditioning can be conveniently expressed in the basic algorithm by simply 
replacing H, in the BFGS update formula by H instead of I and replacing I by H 
in Step 1. Thus (52) becomes 


Hq,p; + p.q; H +(1 auAa:) P,P, 


H,., =H 
a qd pig, / PLP, 


(56) 


and the explicit conjugate gradient version (53) is also modified accordingly. 

Preconditioning can also be used in conjunction with an (m+ 1)-cycle partial 
conjugate gradient version of the memoryless quasi-Newton method. This is highly 
effective if a simple H can be found (as it sometimes can in problems with structure) 
so that the eigenvalues of HF(x,) are such that either all but m are equal to unity 
or they are in m bunches. For large-scale problems, methods of this type seem to 
be quite promising. 


*10.8 COMBINATION OF STEEPEST DESCENT 
AND NEWTON’S METHOD 


In this section we digress from the study of quasi-Newton methods, and again 
expand our collection of basic principles. We present a combination of steepest 
descent and Newton’s method which includes them both as special cases. The 
resulting combined method can be used to develop algorithms for problems having 
special structure, as illustrated in Chapter 13. This method and its analysis comprises 
a fundamental element of the modern theory of algorithms. 
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The method itself is quite simple. Suppose there is a subspace N of E” on 
which the inverse Hessian of the objective function f is known (we shall make 
this statement more precise later). Then, in the quadratic case, the minimum of f 
over any linear variety parallel to N (that is, any translation of N) can be found 
in a single step. To minimize f over the whole space starting at any point x,, we 
could minimize f over the linear variety parallel to N and containing x, to obtain 
Z,; and then take a steepest descent step from there. This procedure is illustrated in 
Fig. 10.1. Since z, is the minimum point of f over a linear variety parallel to N, 
the gradient at z, will be orthogonal to N, and hence the gradient step is orthogonal 
to N. If f is not quadratic we can, knowing the Hessian of f on N, approximate 
the minimum point of f over a linear variety parallel to N by one step of Newton’s 
method. To implement this scheme, that we described in a geometric sense, it is 
necessary to agree on a method for defining the subspace N and to determine what 
information about the inverse Hessian is required so as to implement a Newton step 
over N. We now turn to these questions. 

Often, the most convenient way to describe a subspace, and the one we follow 
in this development, is in terms of a set of vectors that generate it. Thus, if B is 
an n X m matrix consisting of m column vectors that generate N, we may write N 
as the set of all vectors of the form Bu where u € E””. For simplicity we always 
assume that the columns of B are linearly independent. 

To see what information about the inverse Hessian is required, imagine that 
we are at a point x, and wish to find the approximate minimum point z, of f with 
respect to movement in N. Thus, we seek u, so that 


Z, = X, + Bu, 


approximately minimizes f. By “approximately minimizes” we mean that z, should 
be the Newton approximation to the minimum over this subspace. We write 


f (@) = f (&) + VF (x;,) Bu, + Su, B’F (x,) Bu, 


and solve for u, to obtain the Newton approximation. We find 


Xk4 


heat 


Fig. 10.1 Combined method 
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u, = —(B’F(x,)B)'B'Vf(x,)" 
Z, =X, — B(B’F(x,)B)'B’ Vf(x,)’. 


We see by analogy with the formula for Newton’s method that the expression 
B(B’F(x,)B)~'B’ can be interpreted as the inverse of F(x,) restricted to the 
subspace N. 


Example. Suppose 


where I is an mx m identity matrix. This corresponds to the case where N is 
the subspace generated by the first m unit basis elements of E”. Let us partition 
F= V’ f(x,) as 


where F,, is m x m. Then, in this case 
(B'FB)' =F,;, 


and 


F;) 0 
Trp)-!ipr — | Pu 
B(B’ FB) 'B -| 0 af 
which shows explicitly that it is the inverse of F on N that is required. The general 
case can be regarded as being obtained through partitioning in some skew coordinate 
system. 


Now that the Newton approximation over N has been derived, it is possible to 
formalize the details of the algorithm suggested by Fig. 10.1. At a given point x;,, 
the point x,,, is determined through 


a) Set d, = — B(B’F(x,)B) 'B’ Vi(x,)’. 

b) z, =x, + B,d,, where 6, minimizes f(x, + Bd,). (57) 
c) Set p, =—Vf(z,)°. 

d) X,,1 =Z, + &p,, where a, minimizes f(z, + ap,). 


The scalar search parameter 6, is introduced in the Newton part of the algorithm 
simply to assure that the descent conditions required for global convergence are 
met. Normally 6, will be approximately equal to unity. (See Section 8.8.) 
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Analysis of Quadratic Case 


Since the method is not a full Newton method, we can conclude that it possesses only 
linear convergence and that the dominating aspects of convergence will be revealed 
by an analysis of the method as applied to a quadratic function. Furthermore, as 
might be intuitively anticipated, the associated rate of convergence is governed 
by the steepest descent part of algorithm (57), and that rate is governed by a 
Kantorovich-like ratio defined over the subspace orthogonal to N. 


Theorem. (Combined method). Let Q be an n x n symmetric positive definite 
matrix, and let x* € E". Define the function 


E(x) = 4x x°)'Q(x—x*) 


and let b = Qx". Let B be an n x m matrix of rank m. Starting at an arbitrary 
point X,, define the iterative process 
a) u, = —(B’QB)~'B’g,, where g, = Qx, —b. 
b) z, =x, + Bu,. 
Cc) py = b—Qz,. 
T 
d) X,4; =Z,+4,p,, where a, = PiPe 


f, 


This process converges to x", and satisfies 
E(X,41) < (1 — OE) (58) 
where 8, 0<8 <1, is the minimum of 
(p'p)” 
(p’Qp)(pQ 'p) 
over all vectors p in the nullspace of B". 


Proof. The algorithm given in the theorem statement is exactly the general 
combined algorithm specialized to the quadratic situation. Next we note that 


B’p, =B'Q (x* —z,) = B'Q (x* —x,) — B’QBu, (59) 
= —B’g, + BQB’ (B’QB) | B’g, = 0, 


which merely proves that the gradient at z, is orthogonal to N. Next we calculate 
2{E(x;,) — E(%)} = (% — x") Q(x, — x") — (4 — x") QZ, — x") 
= —2u/B’ Q(x, —x*) —u/B’ QBu, 
= —2u/B’g, +u/ B’ QB(B’ QB) 'B’g, 
= —u, Bg, = 2; B(B’QB) 'B’g,. 


(60) 
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Then we compute 


2{E(Z,) — E(X&41)} = (Ze - x") Q(z, =x") =(Xa= x")? O(a =x") 


a —2a, pi, A(z, =2")= a4, P;, QP; 


= 2a, Pi Px = a, P;, QP; (61) 
(Pi Px) 
= @,P; Px = 37On, : 
, Op, 


Now using (59) and p, = —g, — QBu, we have 


2E(x,) = (x, —x*)’ Q(x, —x*) = g/Q''g, 
= (p; +u/B’Q)Q '(p, + QBu,) 
=p Q'p, +u/B’ QBu, 
=p. Q'p, + g; B(B' QB) 'B’g,. 


(62) 


Adding (60) and (61) and dividing by (62) there results 


E(X,) —E%&.1) _ 8 BB'QB) 'B"g, + (peP,)’/Pk QP: 
E(X,) P,Q" 'p, + 8; B(B’ QB) 'B’g, 
_ 4+ (PEPx)/ (Pk QP) 
q+ (REQ 'Py)/(PEPx) 


where q > 0. This has the form (¢+a)/(q+b) with 


_ Pi Px a P,Q 'p, 


a= ; 
p, Qp, Pi Px 


But for any p,, it follows that a < b. Hence 


qta > a 
qtb” b 
and thus 
E(x) — E(X&41) 2 (Pi Px)” 
E(x,) (pz Qp,) (pi Q"'p,) 
Finally, 


Es 51) < BCs) | . (Pi Ps)” }<a 


r) is 
i, Qp,) (Pp; Q"'p,) JBC) 


since B’p, = 0. If 
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The value 6 associated with the above theorem is related to the eigenvalue 
structure of Q. If p were allowed to vary over the whole space, then the Kantorovich 
inequality 


(p'p)’ 4a 
(p7Qp)(p7Q-'p) ~ (a+A)?’ 


where a and A are, respectively, the smallest and largest eigenvalues of Q, gives 
explicitly 


(63) 


5— 4aA 
(a+ Ay 


When p is restricted to the nullspace of B’, the corresponding value of 6 is larger. 
In some special cases it is possible to obtain a fairly explicit estimate of 6. Suppose, 
for example, that the subspace N were the subspace spanned by m eigenvectors of 
Q. Then the subspace in which p is allowed to vary is the space orthogonal to N 
and is thus, in this case, the space generated by the other n — m eigenvectors of Q. 
In this case since for p in N+ (the space orthogonal to N), both Qp and Q™'p are 
also in N+, the ratio 6 satisfies 


7 (p’p) , _4aA 
(p’Qp)(p’Q-'p) © (a+ A)?” 
where now a and A are, respectively, the smallest and largest of the n — m eigen- 


values of Q corresponding to N+. Thus the convergence ratio (58) reduces to the 
familiar form 


where a and A are these special eigenvalues. Thus, if B, or equivalently N, is chosen 
to include the eigenvectors corresponding to the most undesirable eigenvalues of 
Q, the convergence rate of the combined method will be quite attractive. 


Applications 


The combination of steepest descent and Newton’s method can be applied usefully 
in a number of important situations. Suppose, for example, we are faced with a 
problem of the form 


minimize f(x, y), 
where x € E”, y € E”, and where the second partial derivatives with respect to x 


are easily computable but those with respect to y are not. We may then employ 
Newton steps with respect to x and steepest descent with respect to y. 
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Another instance where this idea can be greatly effective is when there are a 
few vital variables in a problem which, being assigned high costs, tend to dominate 
the value of the objective function; in other words, the partial second derivatives 
with respect to these variables are large. The poor conditioning induced by these 
variables can to some extent be reduced by proper scaling of variables, but more 
effectively, by carrying out Newton’s method with respect to them and steepest 
descent with respect to the others. 


10.9 SUMMARY 


The basic motivation behind quasi-Newton methods is to try to obtain, at least on the 
average, the rapid convergence associated with Newton’s method without explicitly 
evaluating the Hessian at every step. This can be accomplished by constructing 
approximations to the inverse Hessian based on information gathered during the 
descent process, and results in methods which viewed in blocks of n steps (where 
n is the dimension of the problem) generally possess superlinear convergence. 

Good, or even superlinear, convergence measured in terms of large blocks, 
however, is not always indicative of rapid convergence measured in terms of 
individual steps. It is important, therefore, to design quasi-Newton methods so 
that their single step convergence is rapid and relatively insensitive to line search 
inaccuracies. We discussed two general principles for examining these aspects of 
descent algorithms. The first of these is the modified Newton method in which 
the direction of descent is taken as the result of multiplication of the negative 
gradient by a positive definite matrix S. The single step convergence ratio of this 
method is determined by the usual steepest descent formula, but with the condition 
number of SF rather than just F used. This result was used to analyze some popular 
quasi-Newton methods, to develop the self-scaling method having good single step 
convergence properties, and to reexamine conjugate gradient methods. 

The second principle method is the combined method in which Newton’s 
method is executed over a subspace where the Hessian is known and steepest 
descent is executed elsewhere. This method converges at least as fast as steepest 
descent, and by incorporating the information gathered as the method progresses, 
the Newton portion can be executed over larger and larger subspaces. 

At this point, it is perhaps valuable to summarize some of the main themes 
that have been developed throughout the four chapters comprising Part II. These 
chapters contain several important and popular algorithms that illustrate the range 
of possibilities available for minimizing a general nonlinear function. From a broad 
perspective, however, these individual algorithms can be considered simply as 
specific patterns on the analytical fabric that is woven through the chapters—the 
fabric that will support new algorithms and future developments. 

One unifying element, that has reproved its value several times, is the Global 
Convergence Theorem. This result helped mold the final form of every algorithm 
presented in Part II and has effectively resolved the major questions concerning 
global convergence. 
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Another unifying element is the speed of convergence of an algorithm, which 
we have defined in terms of the asymptotic properties of the sequences an algorithm 
generates. Initially, it might have been argued that such measures, based on 
properties of the tail of the sequence, are perhaps not truly indicative of the actual 
time required to solve a problem—after all, a sequence generated in practice is 
a truncated version of the potentially infinite sequence, and asymptotic properties 
may not be representative of the finite version—a more complex measure of the 
speed of convergence may be required. It is fair to demand that the validity of 
the asymptotic measures we have proposed be judged in terms of how well they 
predict the performance of algorithms applied to specific examples. On this basis, 
as illustrated by the numerical examples presented in these chapters, and on others, 
the asymptotic rates are extremely reliable predictors of performance—provided 
that one carefully tempers one’s analysis with common sense (by, for example, not 
concluding that superlinear convergence is necessarily superior to linear conver- 
gence when the superlinear convergence is based on repeated cycles of length 7). 
A major conclusion, therefore, of the previous chapters is the essential validity of 
the asymptotic approach to convergence analysis. This conclusion is a major strand 
in the analytical fabric of nonlinear programming. 


10.10 EXERCISES 


1. Prove (4) directly for the modified Newton method by showing that each step of the 
modified Newton method is simply the ordinary method of steepest descent applied to 
a scaled version of the original problem. 


2. Find the rate of convergence of the version of Newton’s method defined by (51), (52) 
of Chapter 8. Show that convergence is only linear if 6 is larger than the smallest 
eigenvalue of F(x*). 


3. Consider the problem of minimizing a quadratic function 
f(x) = tx’Qx —x’b, 


where Q is symmetric and sparse (that is, there are relatively few nonzero entries in Q). 
The matrix Q has the form 


Q=I4V, 
where I is the identity and V is a matrix with eigenvalues bounded by e < | in magnitude. 


a) With the given information, what is the best bound you can give for the rate of 
convergence of steepest descent applied to this problem? 

b) In general it is difficult to invert Q but the inverse can be approximated by I- V, 
which is easy to calculate. (The approximation is very good for small e.) We are 
thus led to consider the iterative process 


X,_) =X — a [I— V]g,. 


where g, = Qx,; —b and a, is chosen to minimize f in the usual way. With the 
information given, what is the best bound on the rate of convergence of this method? 
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c) Show that for e < (/5—1)/2 the method in part (b) is always superior to steepest 
descent. 


This problem shows that the modified Newton’s method is globally convergent under 
very weak assumptions. 

Let a > 0 and b >a be given constants. Consider the collection P of all nxn 
symmetric positive definite matrices P having all eigenvalues greater than or equal to 
a and all elements bounded in absolute value by b. Define the point-to-set mapping 
B:E" > Em by B(x) = {(x, P): P € P}. Show that B is a closed mapping. 

Now given an objective function f € C', consider the iterative algorithm 


Xpy) = Xp — AP, By, 


where g, = g(x;) is the gradient of f at x,, P, is any matrix from P and a, is chosen 
to minimize f(x;,,,). This algorithm can be represented by A which can be decomposed 
as A = SCB where B is defined above, C is defined by C(x, P) = (x, —Pg(x)), and S 
is the standard line search mapping. Show that if restricted to a compact set in E”, the 
mapping A is closed. 

Assuming that a sequence {x,} generated by this algorithm is bounded, show that 
the limit x* of any convergent subsequence satisfies g(x*) = 0. 


The following algorithm has been proposed for minimizing unconstrained functions 
f(x), x € E”, without using gradients: Starting with some arbitrary point xp, obtain a 
direction of search d, such that for each component of d;, 


F(X, = (d,):€;) = min f(X, + de), 


where e; denotes the ith column of the identity matrix. In other words, the ith component 
of d, is determined through a line search minimizing f(x) along the ith component. 

The next point x,,; is then determined in the usual way through a line search along 
d,; that is, 


Xp =X +a, 


where d, minimizes f(x,4,)- 


a) Obtain an explicit representation for the algorithm for the quadratic case where 
F(X) = 3 (K—x")7 Q(x — x") + F(X"). 

b) What condition on f(x) or its derivatives will guarantee descent of this algorithm 
for general f(x)? 

c) Derive the convergence rate of this algorithm (assuming a quadratic objective). 
Express your answer in terms of the condition number of some matrix. 


Suppose that the rank one correction method of Section 10.2 is applied to the quadratic 
problem (2) and suppose that the matrix Ry = F!/7H)F!” has m < n eigenvalues less than 
unity and n—m eigenvalues greater than unity. Show that the condition q; (p, —H,.q,) > 
0 will be satisfied at most m times during the course of the method and hence, if updating 
is performed only when this condition holds, the sequence {H,} will not converge to 
F~!. Infer from this that, in using the rank one correction method, Hy should be taken 
very small; but that, despite such a precaution, on nonquadratic problems the method is 
subject to difficulty. 


10. 


11. 


12. 


13. 


14. 
15. 


16. 


17. 
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Show that if Hj) =I the Davidon-Fletcher-Powell method is the conjugate gradient 
method. What similar statement can be made when Hh is an arbitrary symmetric positive 
definite matrix? 


In the text it is shown that for the Davidon—Fletcher—Powell method H,, is positive 
definite if H, is. The proof assumed that a, is chosen to exactly minimize f(x, + ad,). 
Show that any a, > 0 which leads to pj q, > 0 will guarantee the positive definiteness 
of H,,.;. Show that for a quadratic problem any a, 4 0 leads to a positive definite H,,.;. 


Suppose along the line x,-+ad,, a> 0, the function f(x, + ad,) is unimodal and 
differentiable. Let a, be the minimizing value of a. Show that if any a, > a, is selected 
to define x,,, =x, +@,d,, then pjq, > 0. (Refer to Section 10.3). 


Let {H,}, k=0, 1, 2... be the sequence of matrices generated by the Davidon- 
Fletcher-Powell method applied, without restarting, to a function f having continuous 
second partial derivatives. Assuming that there is a > 0, A > 0 such that for all k we 
have H, — aI and AI—H, positive definite and the corresponding sequence of x,’s is 
bounded, show that the method is globally convergent. 


Verify Eq. (42). 


a) Show that starting with the rank one update formula for H, forming the comple- 
mentary formula, and then taking the inverse restores the original formula. 
b) What value of ¢ in the Broyden class corresponds to the rank one formula? 


Explain how the partial Davidon method can be implemented for m < n/2, with less 
storage than required by the full method. 


Prove statements (1) and (2) below Eq. (47) in Section 10.6. 


Consider using 


p2 H;'p, 
yk = PE 
Py 


instead of (48). 


a) Show that this also serves as a suitable scale factor for a self-scaling quasi-Newton 
method. 
b) Extend part (a) to 


Pi gnaw 


Y% = (U- 9) 
: qi Hg, P.o 


for0<¢@< 1. 


Prove global convergence of the combination of steepest descent and Newton’s method. 


Formulate a rate of convergence theorem for the application of the combination of 
steepest and Newton’s method to nonquadratic problems. 
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18. Prove that if Q is positive definite 


(p’p) J p’Q''p 
p'Qp~p’p 


for any vector p. 


19. It is possible to combine Newton’s method and the partial conjugate gradient method. 
Given a subspace N C E", x,,, is generated from x, by first finding z, by taking a 
Newton step in the linear variety through x, parallel to NV, and then taking m conjugate 
gradient steps from z,. What is a bound on the rate of convergence of this method? 


20. In this exercise we explore how the combined method of Section 10.7 can be updated 
as more information becomes available. Begin with N, = {0}. If N, is represented by 
the corresponding matrix B,, define N,,., by the corresponding B,,; = [B;,, p,], where 


Px = Xe41 — Ze: 


a) If D, = B,(B/ FB,)~'BZ is known, show that 


(p, — D,q,)(P, — Dpay)" 
(Pe — Duque) Gx 


D,.; =D, = > 
where q, = 8,,, — 2,- (This is the rank one correction of Section 10.2.) 

b) Develop an algorithm that uses (a) in conjunction with the combined method of 
Section 10.8 and discuss its convergence properties. 
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Chapter11 CONSTRAINED 
MINIMIZATION 
CONDITIONS 


We turn now, in this final part of the book, to the study of minimization problems 
having constraints. We begin by studying in this chapter the necessary and sufficient 
conditions satisfied at solution points. These conditions, aside from their intrinsic 
value in characterizing solutions, define Lagrange multipliers and a certain Hessian 
matrix which, taken together, form the foundation for both the development and 
analysis of algorithms presented in subsequent chapters. 

The general method used in this chapter to derive necessary and sufficient 
conditions is a straightforward extension of that used in Chapter 7 for unconstrained 
problems. In the case of equality constraints, the feasible region is a curved surface 
embedded in E”. Differential conditions satisfied at an optimal point are derived by 
considering the value of the objective function along curves on this surface passing 
through the optimal point. Thus the arguments run almost identically to those for 
the unconstrained case; families of curves on the constraint surface replacing the 
earlier artifice of considering feasible directions. There is also a theory of zero-order 
conditions that is presented in the final section of the chapter. 


11.1 CONSTRAINTS 


We deal with general nonlinear programming problems of the form 


minimize f(x) 
subject toh, (x) = g(x) <0 
) 8(x) <0 (1) 


hy, (x) =0 8, (X) < 0 

xEQCE’, 
where m <n and the functions f, h;,i = 1,2,...,m and 8pJ= | he eater 7 
are continuous, and usually assumed to possess continuous second partial 
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derivatives. For notational simplicity, we introduce the vector-valued functions 
h= (h,,5,...,h,,) and g = (g,, %,---, gp) and rewrite (1) as 


minimize f(x) 
subject to h(x) =0, g(x) <0 (2) 
xe. 


The constraints h(x) = 0, g(x) < 0 are referred to as functional constraints, 
while the constraint x € © is a set constraint. As before we continue to de-emphasize 
the set constraint, assuming in most cases that either 1 is the whole space E” or 
that the solution to (2) is in the interior of Q. A point x € © that satisfies all the 
functional constraints is said to be feasible. 

A fundamental concept that provides a great deal of insight as well as simpli- 
fying the required theoretical development is that of an active constraint. An 
inequality constraint g;(x) < 0 is said to be active at a feasible point x if g,(x) =0 
and inactive at x if g,(x) < 0. By convention we refer to any equality constraint 
h,(x) = 0 as active at any feasible point. The constraints active at a feasible point 
x restrict the domain of feasibility in neighborhoods of x, while the other, inactive 
constraints, have no influence in neighborhoods of x. Therefore, in studying the 
properties of a local minimum point, it is clear that attention can be restricted to the 
active constraints. This is illustrated in Fig. 11.1 where local properties satisfied by 
the solution x* obviously do not depend on the inactive constraints g, and g3. 

It is clear that, if it were known a priori which constraints were active at the 
solution to (1), the solution would be a local minimum point of the problem defined 
by ignoring the inactive constraints and treating all active constraints as equality 
constraints. Hence, with respect to local (or relative) solutions, the problem could 
be regarded as having equality constraints only. This observation suggests that the 
majority of insight and theory applicable to (1) can be derived by consideration of 
equality constraints alone, later making additions to account for the selection of the 


83(x) =0 


g(x) =0 


Fig. 11.1 Example of inactive constraints 
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active constraints. This is indeed so. Therefore, in the early portion of this chapter 
we consider problems having only equality constraints, thereby both economizing 
on notation and isolating the primary ideas associated with constrained problems. 
We then extend these results to the more general situation. 


11.2 TANGENT PLANE 


A set of equality constraints on E” 


h,(x) =0 
h(x) =0 

(3) 
h,, (x) =0 


defines a subset of E” which is best viewed as a hypersurface. If the constraints 
are everywhere regular, in a sense to be described below, this hypersurface is of 
dimension n — m. If, as we assume in this section, the functions h;,i= 1,2,...,m 
belong to C!, the surface defined by them is said to be smooth. 

Associated with a point on a smooth surface is the tangent plane at that point, 
a term which in two or three dimensions has an obvious meaning. To formalize the 
general notion, we begin by defining curves on a surface. A curve on a surface S 
is a family of points x(t) € S continuously parameterized by t for a < t < b. The 
curve is differentiable if x = (d/dt)x(t) exists, and is twice differentiable if x(t) 
exists. A curve x(t) is said to pass through the point x* if x* = x(t*) for some 
t*,a<t* <b. The derivative of the curve at x* is, of course, defined as x(t*). It is 
itself a vector in E”. 

Now consider all differentiable curves on S passing through a point x*. The 
tangent plane at x* is defined as the collection of the derivatives at x* of all these 
differentiable curves. The tangent plane is a subspace of E”. 

For surfaces defined through a set of constraint relations such as (3), the 
problem of obtaining an explicit representation for the tangent plane is a fundamental 
problem that we now address. Ideally, we would like to express this tangent plane 
in terms of derivatives of functions h, that define the surface. We introduce the 
subspace 


M = {y :Vh(x")y = 0} 


and investigate under what conditions M is equal to the tangent plane at x*. The 
key concept for this purpose is that of a regular point. Figure 11.2 shows some 
examples where for visual clarity the tangent planes (which are sub-spaces) are 
translated to the point x*. 
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Vax)? 


Tangent plane 


h(x)=0 


Vh(x*) 


Tangent plane 


Vh(x*) 


Tangent plane 


hy(x) =0 


Vhy(x*)? 


(c) 


hy(x)=0 


Fig. 11.2 Examples of tangent planes (translated to x*) 


11.2 Tangent Plane 325 


Definition. A point x* satisfying the constraint h(x*) = 0 is said 
to be a regular point of the constraint if the gradient vectors 
Vh,(x*), Vhz(x*), ..., VA,,(x*) are linearly independent. 


Note that if h is affine, h(x) = Ax+b, regularity is equivalent to A having 
rank equal to m, and this condition is independent of x. 

In general, at regular points it is possible to characterize the tangent plane in 
terms of the gradients of the constraint functions. 


Theorem. At a regular point x* of the surface S defined by h(x) = 0 the 
tangent plane is equal to 


M = {y :Vh(x"*)y = 0}. 


Proof. Let T be the tangent plane at x*. It is clear that T C M whether x®* is 
regular or not, for any curve x(f) passing through x* at ¢ = ¢* having derivative 
x(t*) such that Vh(x*)x(¢*) 4 0 would not lie on S. 

To prove that M Cc T we must show that if y € M then there is a curve on $ 
passing through x* with derivative y. To construct such a curve we consider the 
equations 


h(x* + ty + Vh(x*)"u(s)) = 0, (4) 


where for fixed t we consider u(t) € E” to be the unknown. This is a nonlinear 
system of m equations and m unknowns, parameterized continuously, by f. At t= 0 
there is a solution u(0) = 0. The Jacobian matrix of the system with respect to u at 
t =0 is the m x m matrix 


Vh(x*) Vh(x*)’, 


which is nonsingular, since Vh(x*) is of full rank if x* is a regular point. Thus, by the 
Implicit Function Theorem (see Appendix A) there is a continuously differentiable 
solution u(t) in some region —a <t <a. 

The curve x(t) = x* + ty + Vh(x*)/u(7) is thus, by construction, a curve on S. 
By differentiating the system (4) with respect to t at t = 0 we obtain 


d : 
0= Swix) = Vh(x*)y + Vh(x*) Vh(x*)u(0). 
t=0 
By definition of y we have Vh(x*)y = 0 and thus, again since Vh(x*) Vh(x*)” is 
nonsingular, we conclude that x(0) = 0. Therefore 
x(0) = y + Vh(x*)"x(0) =y, 
and the constructed curve has derivative y at x*. | 


It is important to recognize that the condition of being a regular point is not a 
condition on the constraint surface itself but on its representation in terms of an h. 
The tangent plane is defined independently of the representation, while M is not. 
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Example. In E? let h(x,, x.) =x,. Then h(x) = 0 yields the x, axis, and every 
point on that axis is regular. If instead we put h(x,,x,) = xj, again S is the x, 
axis but now no point on the axis is regular. Indeed in this case M = E’, while the 
tangent plane is the x, axis. 


11.3 FIRST-ORDER NECESSARY CONDITIONS 
(EQUALITY CONSTRAINTS) 


The derivation of necessary and sufficient conditions for a point to be a local 
minimum point subject to equality constraints is fairly simple now that the represen- 
tation of the tangent plane is known. We begin by deriving the first-order necessary 
conditions. 


Lemma. Let x* be a regular point of the constraints h(x) = 90 and a local 
extremum point (a minimum or maximum) of f subject to these constraints. 
Then ally € E" satisfying 


Vh(x*)y =0 (5) 
must also satisfy 


V(x )y = 0. (6) 


Proof. Let y be any vector in the tangent plane at x* and let x(t) be any smooth 
curve on the constraint surface passing through x* with derivative y at x*; that is, 
x(0) = x*, x(0) = y, and h(x(t)) = 0 for —a <t <a for some a > 0. 

Since x* is a regular point, the tangent plane is identical with the set of y’s 
satisfying Vh(x*)y = 0. Then, since x* is a constrained local extremum point of f, 
we have 


Sse) | =o. 


t=0 


or equivalently, 


Vf(x*)y = 0. | 


The above Lemma says that Vf(x*) is orthogonal to the tangent plane. Next 
we conclude that this implies that Vf(x*) is a linear combination of the gradients 
of h at x*, a relation that leads to the introduction of Lagrange multipliers. 
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Theorem. Let x* be a local extremum point of f subject to the constraints 
h(x) = 0. Assume further that x* is a regular point of these constraints. Then 
there isa \ € E™ such that 


V f(x*) +A" Vh(x*) = 0. (7) 


Proof. From the Lemma we may conclude that the value of the linear program 
maximize WV/f(x*)y 


subject to Vh(x*)y =0 


is zero. Thus, by the Duality Theorem of linear programming (Section 4.2) 
the dual problem is feasible. Specifically, there is NX € E” such that Vf(x*) + 
d’ Vh(x*) = 0. If 


It should be noted that the first-order necessary conditions 
V f(x") +A" Vh(x*) = 0 
together with the constraints 
h(x*) =0 


give a total of n+ m (generally nonlinear) equations in the n+ m variables 
comprising x*, A. Thus the necessary conditions are a complete set since, at least 
locally, they determine a unique solution. 

It is convenient to introduce the Lagrangian associated with the constrained 
problem, defined as 


I(x, X) = f(x) +A" h(x). (8) 

The necessary conditions can then be expressed in the form 
V,i(x, X) =0 (9) 
V, U(x, A) =0, (10) 


the second of these being simply a restatement of the constraints. 


11.4 EXAMPLES 


We digress briefly from our mathematical development to consider some examples 
of constrained optimization problems. We present five simple examples that can 
be treated explicitly in a short space and then briefly discuss a broader range of 
applications. 
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Example 1. Consider the problem 


minimize x, X) +2X)x%3+Xx)x; 


subject to x, +2) +x; =3. 
The necessary conditions become 


X,+xX,+A=0 
X +x3+A=0 
xX; +X +rA=0. 


These three equations together with the one constraint equation give four equations 
that can be solved for the four unknowns x,, x, x3, \. Solution yields x, = x, = 
x3=1,\4=-2. 


Example 2 (Maximum volume). Let us consider an example of the type that is 
now standard in textbooks and which has a structure similar to that of the example 
above. We seek to construct a cardboard box of maximum volume, given a fixed 
area of cardboard. 

Denoting the dimensions of the box by x, y, z, the problem can be expressed 
as 


maximize xyz 


subject to (xy+yz+xz) = - (11) 


where c > 0 is the given area of cardboard. Introducing a Lagrange multiplier, the 
first-order necessary conditions are easily found to be 


ye+dy+z)=0 
xz+X(x+z) =0 (12) 
xy+d(x+y) =0 


together with the constraint. Before solving these, let us note that the sum of these 
equations is (xy ++ yz+xz)+2\(x+ y+ z) =0. Using the constraint this becomes 
c/2+2d(x+y+z) = 0. From this it is clear that \ 40. Now we can show that 
x, y, and z are nonzero. This follows because x = 0 implies z = 0 from the second 
equation and y = 0 from the third equation. In a similar way, it is seen that if either 
x, y, or z are zero, all must be zero, which is impossible. 

To solve the equations, multiply the first by x and the second by y, and then 
subtract the two to obtain 


A(x— y)z =0. 
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Operate similarly on the second and third to obtain 
A(y—z)x =0. 


Since no variables can be zero, it follows that x = y= z= ,/c/6 is the unique 
solution to the necessary conditions. The box must be a cube. 


Example 3 (Entropy). Optimization problems often describe natural phenomena. 
An example is the characterization of naturally occurring probability distributions 
as maximum entropy distributions. 

As a specific example consider a discrete probability density corresponding to 
a measured value taking one of n values x,, x,,...,x,. The probability associated 


with x; is p;. The p;’s satisfy p; > 0 and }° p; = 1. 
i=l 
The entropy of such a density is 


oS — DP: log(p;). 


i=1 


The mean value of the density is }> x;p;. 


i=1 
If the value of mean is known to be m (by the physical situation), the maximum 
entropy argument suggests that the density should be taken as that which solves the 
following problem: 


maximize — ) > p;log(p;) 


i=1 


subject to ) p; = 1 
. 03) 


We begin by ignoring the nonnegativity constraints, believing that they may 
be inactive. Introducing two Lagrange multipliers, \ and yw, the Lagrangian is 


i=l 
The necessary conditions are immediately found to be 

—logp,-1+A+px,=0, i=1,2,...,n. 
This leads to 


p,=exp{(A—1)+px,}, i1=1,2,...,2n. (14) 
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We note that p; > 0, so the nonnegativity constraints are indeed inactive. The result 
(14) is known as an exponential density. The Lagrange multipliers \ and pw are 
parameters that must be selected so that the two equality constraints are satisfied. 


Example 4 (Hanging chain). A chain is suspended from two thin hooks that are 
16 feet apart on a horizontal line as shown in Fig. 11.3. The chain itself consists of 
20 links of stiff steel. Each link is one foot in length (measured inside). We wish 
to formulate the problem to determine the equilibrium shape of the chain. 

The solution can be found by minimizing the potential energy of the chain. Let 
us number the links consecutively from | to 20 starting with the left end. We let 
link i span an x distance of x; and a y distance of y;. Then x7 + y? = 1. The potential 
energy of a link is its weight times its vertical height (from some reference). The 
potential energy of the chain is the sum of the potential energies of each link. We 
may take the top of the chain as reference and assume that the mass of each link is 
concentrated at its center. Assuming unit weight, the potential energy is then 


1 1 1 
sut(ntdx)+(ntmtps)t- 


1 n : 1 
2 NEO Pa el > ARE ie 


where n = 20 in our example. 
The chain is subject to two constraints: The total y displacement is zero, and 
the total x displacement is 16. Thus the equilibrium shape is the solution of 


a 1 
minimize ) > (» —i+ 5) y; 
i=l 


subject to }> y,; =0 (15) 


i=1 
yey 16. 
i=1 


16 ft 
S 
link 


chain 


Fig. 11.3 A hanging chain 
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The first-order necessary conditions are 


1 : 
(: i435) +n Pat oi (16) 


V1l-y; 


fori=1,2,...,n. This leads directly to 
n—it34+X 
Juet (nit b+r) 


Ji (17) 


As in Example 2 the solution is determined once the Lagrange multipliers are 
known. They must be selected so that the solution satisfies the two constraints. 

It is useful to point out that problems of this type may have local minimum 
points. The reader can examine this by considering a short chain of, say, four links 
and v and w configurations. 


Example 5 (Portfolio design). Suppose there are n securities indexed by i = 


1,2,...,. Each security i is characterized by its random rate of return 7; which 
has mean value 7;. Its covariances with the rates of return of other securtities are 
Oi; for j= 1,2,...,. The portfolio problem is to allocate total available wealth 


among these 7 securities, allocating a fraction w; of wealth to the security i. 


n 


The overall rate of return of a portfolio is r = )°;"_, w,r;. This has mean value 


7 = DL, w,F; and variance 0? = Y7} _, wjo,jw;- 

Markowitz introduced the concept of devising efficient portfolios which for a 
given expected rate of return 7 have minimum possible variance. Such a portfolio 
is the solution to the problem 

Fi n 
min >? ja Wii 


Wj,W,.--,W, I= 


cr 


ae w,= 1. 


The second constraint forces the sum of the weights to equal one. There may be 
the further restriction that each w, > 0 which would imply that the securities must 
not be shorted (that is, sold short). 

Introducing Lagrange multipliers \ and yw for the two constraints leads easily 
to the n+ 2 linear equations 


subject to en w,F 


Yo o,;w,+\7,+4=0 fori=1,2,...,n 


j=! 


in the n+ 2 unknowns (the w,’s, \ and pw). 


332 Chapter 11 Constrained Minimization Conditions 


Large-Scale Applications 


The problems that serve as the primary motivation for the methods described in 
this part of the book are actually somewhat different in character than the problems 
represented by the above examples, which by necessity are quite simple. Larger, 
more complex, nonlinear programming problems arise frequently in modern applied 
analysis in a wide variety of disciplines. Indeed, within the past few decades 
nonlinear programming has advanced from a relatively young and primarily analytic 
subject to a substantial general tool for problem solving. 

Large nonlinear programming problems arise in problems of mechanical struc- 
tures, such as determining optimal configurations for bridges, trusses, and so 
forth. Some mechanical designs and configurations that in the past were found by 
solving differential equations are now often found by solving suitable optimization 
problems. An example that is somewhat similar to the hanging chain problem is 
the determination of the shape of a stiff cable suspended between two points and 
supporting a load. 

A wide assortment, of large-scale optimization problems arise in a similar way 
as methods for solving partial differential equations. In situations where the under- 
lying continuous variables are defined over a two- or three-dimensional region, 
the continuous region is replaced by a grid consisting of perhaps several thousand 
discrete points. The corresponding discrete approximation to the partial differ- 
ential equation is then solved indirectly by formulating an equivalent optimization 
problem. This approach is used in studies of plasticity, in heat equations, in the 
flow of fluids, in atomic physics, and indeed in almost all branches of physical 
science. 

Problems of optimal control lead to large-scale nonlinear programming 
problems. In these problems a dynamic system, often described by an ordinary 
differential equation, relates control variables to a trajectory of the system state. This 
differential equation, or a discretized version of it, defines one set of constraints. 
The problem is to select the control variables so that the resulting trajectory satisfies 
various additional constraints and minimizes some criterion. An early example of 
such a problem that was solved numerically was the determination of the trajectory 
of a rocket to the moon that required the minimum fuel consumption. 

There are many examples of nonlinear programming in industrial operations 
and business decision making. Many of these are nonlinear versions of the kinds 
of examples that were discussed in the linear programming part of the book. 
Nonlinearities can arise in production functions, cost curves, and, in fact, in almost 
all facets of problem formulation. 

Portfolio analysis, in the context of both stock market investment and evalu- 
ation of a complex project within a firm, is an area where nonlinear programming 
is becoming increasingly useful. These problems can easily have thousands of 
variables. 

In many areas of model building and analysis, optimization formulations are 
increasingly replacing the direct formulation of systems of equations. Thus large 
economic forecasting models often determine equilibrium prices by minimizing 
an objective termed consumer surplus. Physical models are often formulated 
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as minimization of energy. Decision problems are formulated as maximizing 
expected utility. Data analysis procedures are based on minimizing an average 
error or maximizing a probability. As the methodology for solution of nonlinear 
programming improves, one can expect that this trend will continue. 


11.5 SECOND-ORDER CONDITIONS 


By an argument analogous to that used for the unconstrained case, we can also derive 
the corresponding second-order conditions for constrained problems. Throughout 
this section it is assumed that f,h € C’. 


Second-Order Necessary Conditions. Suppose that x* is a local minimum of 
f subject to h(x) = 0 and that x* is a regular point of these constraints. Then 
there is a NX € E” such that 


Vf(x*) +1 Vh(x*) = 0. (18) 
If we denote by M the tangent plane M = {y : Vh(x*)y = 0}, then the matrix 
L(x*) = F(x*) + A7H(x*) (19) 


is positive semidefinite on M, that is, y'L(x*)y > 0 for ally € M. 


Proof. From elementary calculus it is clear that for every twice differentiable 
curve on the constraint surface S through x* (with x(0) = x*) we have 


a 
Thao) >°. (20) 
t=0 
By definition 
da : : ss 
Sifts) = X(0)"F(x*)x(0) + Vf(X")X(0). (21) 
t=0 
Furthermore, differentiating the relation N’h(x(t)) = 0 twice, we obtain 
x(0)’ 7 H(x*)x(0) + A? Vih(x*)X(0) = 0. (22) 
Adding (22) to (21), while taking account of (20), yields the result 
a 
— f(x(t)) = x(0)"L(x*)x(0) > 0. 
dt? 5 


Since x(0) is arbitrary in M, we immediately have the stated conclusion. J 


The above theorem is our first encounter with the matrix L = F-+A’H which 
is the matrix of second partial derivatives, with respect to x, of the Lagrangian /. 
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(See Appendix A, Section A.6, for a discussion of the notation \’H used here.) 
This matrix is the backbone of the theory of algorithms for constrained problems, 
and it is encountered often in subsequent chapters. 

We next state the corresponding set of sufficient conditions. 


Second-Order Sufficiency Conditions. Suppose there is a point x* satisfying 
h(x*) = 0, and a XN € E" such that 


Vf(x*) +A" Vh(x*) = 0. (23) 


Suppose also that the matrix L(x*) = F(x*) + X"H(x*) is positive definite on 
M ={y: Vh(x*)y = 0}, that is, for y ¢ M, y £0 there holds y’L(x*)y > 0. 
Then x* is a strict local minimum of f subject to h(x) = 0. 


Proof. Tf x* is not a strict relative minimum point, there exists a sequence of 
feasible points {y,} converging to x* such that for each k, f(y,) < f(x*). Write 
each y, in the form y, = x*+6,s, where s, € E”, |s,| = 1, and 6, > 0 for each 
k. Clearly, 5, — 0 and the sequence {s,}, being bounded, must have a convergent 
subsequence converging to some s*. For convenience of notation, we assume that 
the sequence {s,} is itself convergent to s*. We also have h(y,) — h(x*) = 0, and 
dividing by 6, and letting k — oo we see that Vh(x*)s* = 0. 
Now by Taylor’s theorem, we have for each j 


82 
0 = hj(y,) = hj(x") +6, VAj(x")s, + 58 VA ()S: (24) 
and 
* * 5 To2 
02> f(y) — f&*) = 6, VIX"), + 5 Sk Vi’ f(M0) Sk (25) 


where each 9; is a point on the line segment joining x* and y,. Multiplying (24) 
by A, and adding these to (25) we obtain, on accounting for (23), 


82 m 
0> Ss Vv’ f(M) + > AFH Sk. 


_k 

2 i=l 
which yields a contradiction as k > oo. J 
Example 1. Consider the problem 


maximize XX, +2X)x3+%,Xx; 


subject to x, +x,+x, = 3. 
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In Example | of Section 11.4 it was found that x, = x, = x; = 1,’ = —2 satisfy 
the first-order conditions. The matrix F + 47H becomes in this case 


mS 

II 
—- ee © 
- OF 


1 
, hs 
0 


which itself is neither positive nor negative definite. On the subspace M = {y: 
y, ty. + 3 = 0}, however, we note that 


y Ly = yi (v2 + ¥3) + yoni +3) + 9301 + y2) 


=—(yj +345), 


and thus L is negative definite on M. Therefore, the solution we found is at least a 
local maximum. 


11.6 EIGENVALUES IN TANGENT SUBSPACE 


In the last section it was shown that the matrix L restricted to the subspace M 
that is tangent to the constraint surface plays a role in second-order conditions 
entirely analogous to that of the Hessian of the objective function in the uncon- 
strained case. It is perhaps not surprising, in view of this, that the structure of L 
restricted to M also determines rates of convergence of algorithms designed for 
constrained problems in the same way that the structure of the Hessian of the 
objective function does for unconstrained algorithms. Indeed, we shall see that the 
eigenvalues of L restricted to M determine the natural rates of convergence for 
algorithms designed for constrained problems. It is important, therefore, to under- 
stand what these restricted eigenvalues represent. We first determine geometrically 
what we mean by the restriction of L to M which we denote by L,,. Next we 
define the eigenvalues of the operator L,,. Finally we indicate how these various 
quantities can be computed. 

Given any vector y € M, the vector Ly is in E” but not necessarily in M. 
We project Ly orthogonally back onto M, as shown in Fig. 11.4, and the result 
is said to be the restriction of L to M operating on y. In this way we obtain a 
linear transformation from M to M. The transformation is determined somewhat 
implicitly, however, since we do not have an explicit matrix representation. 

A vector y € M is an eigenvector of Ly if there is a real number X such that 
Lyy = dy; the corresponding Xd is an eigenvalue of Ly. This coincides with the 
standard definition. In terms of L we see that y is an eigenvector of Ly if Ly can 
be written as the sum of Ay and a vector orthogonal to M. See Fig. 11.5. 

To obtain a matrix representation for Ly, it is necessary to introduce a basis 
in the subspace M. For simplicity it is best to introduce an orthonormal basis, say 
€,,€,...,€,_,,- Define the matrix E to be the n x (n—m) matrix whose columns 
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Fig. 11.4 Definition of Ly, 


consist of the vectors e;. Then any vector y in M can be written as y = Ez for some 
z€ E”"~" and, of course, LEz represents the action of L on such a vector. To project 
this result back into M and express the result in terms of the basis e€,,€),...,€,_ 5 
we merely multiply by E’. Thus E’LEz is the vector whose components give the 
representation in terms of the basis; and, correspondingly, the (m—m) x (n—m) 
matrix ELE is the matrix representation of L restricted to M. 

The eigenvalues of L restricted to M can be found by determining the eigen- 
values of E7LE. These eigenvalues are independent of the particular orthonormal 


basis E. 


Example 1. In the last section we considered 


0 1 1 
i 1 0 1 
1 1 O 
Ly 
ay y 


Fig. 11.5 Eigenvector of Ly, 
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restricted to M = {y: y, + y,+ 3 = 0}. To obtain an explicit matrix representation 
on M let us introduce the orthonormal basis: 


1 
e, = —(,0,-1 
1 ao ) 


This gives, upon expansion, 


me _| -1 0 
BLE =| a 


and hence L restricted to M acts like the negative of the identity. 
Example 2. Let us consider the problem 

extremize x,;+ ca +.x5x3+ pam 

subject to (4 +45+2x3)=1. 
The first-order necessary conditions are 


1+ Ax, =0 
2X, +%3+AX, =0 
X,+4x,+Ax; = 0. 


One solution to this set is easily seen to be x, = 1, x, = 0, x; =0, X= —1. Let us 
examine the second-order conditions at this solution point. The Lagrangian matrix 
there is 


and the corresponding subspace M is 
M = {y: y, =0}. 
In this case M is the subspace spanned by the second two basis vectors in E* and 


hence the restriction of L to M can be found by taking the corresponding submatrix 
of L. Thus, in this case, 


E’LE= ; 


We 
| a | 
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The characteristic polynomial of this matrix is 


f=) 4 2 
cet | i ga |e NOMI 49, 


The eigenvalues of L,, are thus \ = 2+ 2, and Ly, is positive definite. 

Since the Ly, matrix is positive definite, we conclude that the point found is a 
relative minimum point. This example illustrates that, in general, the restriction of 
L to M can be thought of as a submatrix of L, although it can be read directly from 
the original matrix only if the subspace M is spanned by a subset of the original 
basis vectors. 


Bordered Hessians 


The above approach for determining the eigenvalues of L projected onto M is quite 
direct and relatively simple. There is another approach, however, that is useful 
in some theoretical arguments and convenient for simple applications. It is based 
on constructing matrices and determinants of order n+ m rather than n —m, so 
dimension is increased. 

Let us first characterize all vectors orthogonal to M. M itself is the set of all x 
satisfying Vhx = 0. A vector z is orthogonal to M if z’x =0 for all x € M. It is not 
hard to show that z is orthogonal to M if and only if z= Vh’ w for some w € E”. 
The proof that this is sufficient follows from the calculation z’x = w! Vhx = 0. 
The proof of necessity follows from the Duality Theorem of Linear Programming 
(see Exercise 6). 

Now we may explicitly characterize an eigenvector of L,,. The vector x is 
such an eigenvector if it satisfies these two conditions: (1) x belongs to M, and (2) 
Lx = \x+z, where z is orthogonal to M. These conditions are equivalent, in view 
of the characterization of z, to 


Vhx = 0 
Lx = \x+ Vh'w. 
This can be regarded as a homogeneous system of n+ m linear equations in the 


unknowns w, x. It possesses a nonzero solution if and only if the determinant of 
the coefficient matrix is zero. Denoting this determinant p(A), we have 


det Joe Pee = p(A) =0 (26) 


as the condition. The function p(A) is a polynomial in \ of degree n— m. It is, as 
we have derived, the characteristic polynomial of Ly. 
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Example 3. Approaching Example 2 in this way we have 


0 1 0 0 
_ 4) =f 0 0 
PNSET) 4G 6 ° Gare 1 
0 oO 1 (3—d) 


This determinant can be evaluated by using Laplace’s expansion down the first 
column. The result is 


pQr)=(1-A)B-A)—1, 
which is identical to that found earlier. 


The above treatment leads one to suspect that it might be possible to extend 
other tests for positive definiteness over the whole space to similar tests in the 
constrained case by working in n+ m dimensions. We present (but do not derive) 
the following classic criterion, which is of this type. It is expressed in terms of the 
bordered Hessian matrix 


0 Vh 
B=| ype sal ey) 


(Note that by convention the minus sign in front of Wh’ is deleted to make B 
symmetric; this only introduces sign changes in the conclusions.) 


Bordered Hessian Test. The matrix L is positive definite on the subspace 
M = {x: Vhx = 0} if and only if the last n—™m principal minors of B all have 


sign (—1)”. 


For the above example we form 


0 1 0:0 

= 1 —-l1 0:0 
aed Ge ee 
0 0 1 3 


and check the last two principal minors—the one indicated by the dashed lines and 
the whole determinant. These are —1, —2, which both have sign (—1)!, and hence 
the criterion is satisfied. 


11.7 SENSITIVITY 


The Lagrange multipliers associated with a constrained minimization problem have 
an interpretation as prices, similar to the prices associated with constraints in linear 
programming. In the nonlinear case the multipliers are associated with the particular 
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solution point and correspond to incremental or marginal prices, that is, prices 
associated with small variations in the constraint requirements. 
Suppose the problem 


minimize f(x) 
(28) 
subject to h(x) =0 


has a solution at the point x* which is a regular point of the constraints. Let A be the 
corresponding Lagrange multiplier vector. Now consider the family of problems 


minimize f(x) 
(29) 
subject to h(x) =c, 


where c € E”. For a sufficiently small range of ¢ near the zero vector, the problem 
will have a solution point x(c) near x(0) = x*. For each of these solutions there is a 
corresponding value f(x(c)), and this value can be regarded as a function of ¢, the 
right-hand side of the constraints. The components of the gradient of this function 
can be interpreted as the incremental rate of change in value per unit change in 
the constraint requirements. Thus, they are the incremental prices of the constraint 
requirements measured in units of the objective. We show below how these prices 
are related to the Lagrange multipliers of the problem having c = 0. 


Sensitivity Theorem. Let f, h € C? and consider the family of problems 


minimize f(x) 
(29) 
subject to h(x) =e. 


Suppose for ¢ = 0 there is a local solution x* that is a regular point and that, 
together with its associated Lagrange multiplier vector \, satisfies the second- 
order sufficiency conditions for a strict local minimum. Then for every ¢ € E™ 
in a region containing 0 there is an x(c), depending continuously on ¢, such 
that x(0) = x* and such that x(c) is a local minimum of (29). Furthermore, 


Vv. ee) | =n". 
c=0 
Proof. Consider the system of equations 
V f(x) +A’ Vh(x) = 0 (30) 
h(x) =c. (31) 


By hypothesis, there is a solution x*, A to this system when c = 0. The Jacobian 
matrix of the system at this solution is 


L(x*) = Vh(x*)? 
Vh(x’) 0 
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Because by assumption x* is a regular point and L(x*) is positive definite on M, 
it follows that this matrix is nonsingular (see Exercise 11). Thus, by the Implicit 
Function Theorem, there is a solution x(c), A(c) to the system which is in fact 
continuously differentiable. 

By the chain rule we have 


vefts(e)| = V, f(x") V.x(0). 


c=0 


and 


Vehix(e)) — V,h(x*)V,x(0). 


c=0 


In view of (31), the second of these is equal to the identity I on E”, while this, in 
view of (30), implies that the first can be written 


v.fts(e)| —-y".] 


c=0 


11.8 INEQUALITY CONSTRAINTS 


We consider now problems of the form 


minimize f(x) 
subject to h(x) =0 (32) 
g(x) <0. 


We assume that f and h are as before and that g is a p-dimensional function. 
Initially, we assume f,h, g € C!. 

There are a number of distinct theories concerning this problem, based on 
various regularity conditions or constraint qualifications, which are directed toward 
obtaining definitive general statements of necessary and sufficient conditions. One 
can by no means pretend that all such results can be obtained as minor extensions 
of the theory for problems having equality constraints only. To date, however, these 
alternative results concerning necessary conditions have been of isolated theoretical 
interest only—for they have not had an influence on the development of algorithms, 
and have not contributed to the theory of algorithms. Their use has been limited to 
small-scale programming problems of two or three variables. We therefore choose 
to emphasize the simplicity of incorporating inequalities rather than the possible 
complexities, not only for ease of presentation and insight, but also because it is 
this viewpoint that forms the basis for work beyond that of obtaining necessary 
conditions. 
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First-Order Necessary Conditions 


With the following generalization of our previous definition it is possible to parallel 
the development of necessary conditions for equality constraints. 


Definition. Let x* be a point satisfying the constraints 
h(x*)=0, g(x*) <0, (33) 


and let J be the set of indices j for which g;(x*) = 0. Then x* is said to be a 
regular point of the constraints (33) if the gradient vectors Vh;(x*), Vg;(x*), 
1<i<m,j€J are linearly independent. 


We note that, following the definition of active constraints given in 
Section 11.1, a point x* is a regular point if the gradients of the active constraints 
are linearly independent. Or, equivalently, x* is regular for the constraints if it is 
regular in the sense of the earlier definition for equality constraints applied to the 
active constraints. 


Karush-Kuhn-Tucker Conditions. Let x* be a relative minimum point for the 
problem 


minimize f(x) 
(34) 
subject to h(x)=0, g(x) <0, 


and suppose x* is a regular point for the constraints. Then there is a vector 
dX € E” and a vector pw € E? with > 0 such that 


V f(x*) +A" Vh(x*) +p’ Vg(x*) = 0 (35) 
p'g(x*) =0. (36) 


Proof. We note first, since p > 0 and g(x*) < 0, (36) is equivalent to the statement 
that a component of fw may be nonzero only if the corresponding constraint is 
active. This a complementary slackness condition, stating that g(x*); < 0 implies 
by; = 0 and p,; > 0 implies g(x*); = 0. 

Since x* is a relative minimum point over the constraint set, it is also a relative 
minimum over the subset of that set defined by setting the active constraints to zero. 
Thus, for the resulting equality constrained problem defined in a neighborhood of 
x*, there are Lagrange multipliers. Therefore, we conclude that (35) holds with 
f; = 0 if g;(x*) 4 0 (and hence (36) also holds). 

It remains to be shown that w > 0. Suppose pu, <0 for some k € J. Let S 
and M be the surface and tangent plane, respectively, defined by all other active 
constraints at x*. By the regularity assumption, there is a y such that y ¢ M and 
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Vg,(x*)y < 0. Let x(t) be a curve on S passing through x* (at t = 0) with x(0) = y. 
Then for small t > 0, x(t) is feasible, and 


sin] = Vy(x*)y <0 


t=0 
by (35), which contradicts the minimality of x*. IJ 


Example. Consider the problem 


minimize 2x; +2x,x, +x; —10x,— 10x, 
subject to. x}+x5 <5 


3x, +X) <6. 
The first-order necessary conditions, in addition to the constraints, are 


4x, +2x, —10+2u,x,+3y, =0 
2x, +2x,—10+2u,x,+ pM, =0 
M20, p20 
B(x} +43 —5) =0 
My(3x, +X, — 6) = 0. 
To find a solution we define various combinations of active constraints and check 
the signs of the resulting Lagrange multipliers. In this problem we can try setting 


none, one, or two constraints active. Assuming the first constraint is active and the 
second is inactive yields the equations 


4x, +2x,—10+2u,x, =0 
2x, +2x,—10+2u,x, =0 


x +x5 =5, 
which has the solution 
x=1, m®=2, p=). 


This yields 3x, +x, =5 and hence the second constraint is satisfied. Thus, since 
bt, > 0, we conclude that this solution satisfies the first-order necessary conditions. 
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Second-Order Conditions 


The second-order conditions, both necessary and sufficient, for problems with 
inequality constraints, are derived essentially by consideration only of the equality 
constrained problem that is implied by the active constraints. The appropriate 
tangent plane for these problems is the plane tangent to the active constraints. 


Second-Order Necessary Conditions. Suppose the functions f,g,h € C? and 
that x* is a regular point of the constraints (33). If x* is a relative minimum 
point for problem (32), then there isa NE E”, we E?, w > 0 such that (35) 
and (36) hold and such that 


L(x*) = F(x*) + A” H(x*) + p’G(x*) (37) 


is positive semidefinite on the tangent subspace of the active constraints at x”. 


Proof. If x* is a relative minimum point over the constraints (33), it is also a 
relative minimum point for the problem with the active constraints taken as equality 
constraints. | 


Just as in the theory of unconstrained minimization, it is possible to formulate 
a converse to the Second-Order Necessary Condition Theorem and thereby obtain a 
Second-Order Sufficiency Condition Theorem. By analogy with the unconstrained 
situation, one can guess that the required hypothesis is that L(x*) be positive definite 
on the tangent plane M. This is indeed sufficient in most situations. However, if 
there are degenerate inequality constraints (that is, active inequality constraints 
having zero as associated Lagrange multiplier), we must require L(x*) to be positive 
definite on a subspace that is larger than M. 


Second-Order Sufficiency Conditions. Let f,g,h ¢ C?. Sufficient conditions 
that a point x* satisfying (33) be a strict relative minimum point of problem 
(32) is that there exist N € E™, wp © E?, such that 


p>o (38) 
pb’ g(x") = 0 (39) 
V f(x*) +A" Vh(x*) +p” Vg(x*) = 0, (40) 
and the Hessian matrix 
L(x*) = F(x*) + A’ H(x*) + p’ G(x") (41) 


is positive definite on the subspace 
M' = {y : Vh(x*)y = 0, Vg,(x*)y =0 for all je J}, 
where 


J = {j: 8)(x") =0, uw; > 0}. 
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Proof. — As in the proof of the corresponding theorem for equality constraints in 
Section 11.5, assume that x* is not a strict relative minimum point; let {y,} be a 
sequence of feasible points converging to x* such that f(y,) < f(x*), and write each 
y, in the form y, = x*+6,s, with |s,| = 1,6, > 0. We may assume that 8, > 0 
and s, > s*. We have 0 > Vf(x*)s*, and for each i= 1,..., m we have 


Vh;(x")s* = 0. 
Also for each active constraint g; we have g;(y,) — g;(x*) < 0, and hence 
Ve(x")s* <0. 


If Vg;(x*)s* =0 for all j € J, then the proof goes through just as in Section 11.5. 
If Vg;(x*)s* <0 for at least one j € J, then 


0 > Vf(x*)s* = —N7 Vh(x*)s* — p’ Ve(x*)s* > 0, 


which is a contradiction. ff 


We note in particular that if all active inequality constraints have strictly 
positive corresponding Lagrange multipliers (no degenerate inequalities), then the 
set J includes all of the active inequalities. In this case the sufficient condition is that 
the Lagrangian be positive definite on M, the tangent plane of active constraints. 


Sensitivity 


The sensitivity result for problems with inequalities is a simple restatement of the 
result for equalities. In this case, a nondegeneracy assumption is introduced so 
that the small variations produced in Lagrange multipliers when the constraints are 
varied will not violate the positivity requirement. 


Sensitivity Theorem. Let f,g,h € C? and consider the family of problems 


minimize f(x) 
subject to h(x) =c (42) 
g(x) <d. 


Suppose that for c=0, d=0, there is a local solution x* that is a regular 
point and that, together with the associated Lagrange multipliers, X, p> 0, 
satisfies the second-order sufficiency conditions for a strict local minimum. 
Assume further that no active inequality constraint is degenerate. Then for 
every (c,d) € E”*? in a region containing (0,0) there is a solution x(c, d), 
depending continuously on (¢e, da), such that x(0, 0) = x*, and such that x(c, d) 
is a relative minimum point of (42). Furthermore, 


Vef(x(e, a) =—y? (43) 


0,0 


Valle, a) i (44) 


0,0 
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11.9 ZERO-ORDER CONDITIONS AND LAGRANGE 
MULTIPLIERS 


Zero-order conditions for functionally constrained problems express conditions in 
terms of Lagrange multipliers without the use of derivatives. This theory is not only 
of great practical value, but it also gives new insight into the meaning of Lagrange 
multipliers. Rather than regarding the Lagrange multipliers as separate scalars, 
they are identified as components of a single vector that has a strong geometric 
interpretation. As before, the basic constrained problem is 


minimize f(x) 
subject to h(x)=0, g(x) <0 (45) 
xeQ, 


where x is a vector in FE”, and h and g are m-dimensional and p-dimensional 
functions, respectively. 

In purest form, zero-order conditions require that the functions that define the 
objective and the constraints are convex functions and sets. (See Appendix B). 

The vector-valued function g consisting of p individual component functions 
81, 82,-+++ 8, IS said to be convex if each of the component functions is convex. 

The programming problem (45) above is termed a convex programming 
problem if the functions f and g are convex, the function h is affine (that is, linear 
plus a constant), and the set O C E” is convex. 

Notice that according to Proposition 3, Section 7.4, the set defined by each of 
the inequalities g;(x) < 0 is convex. This is true also of a set defined by h;(x) = 
0. Since the overall constraint set is the intersection of these and Q it follows from 
Proposition | of Appendix B that this overall constraint set is itself convex. Hence the 
problem can be regarded as minimize f(x), x € O, where ©, is aconvex subset of 2. 

With this view, one could apply the zero-order conditions of Section 7.6 to the 
problem with constraint set Q,. However, in the case of functional constraints it 
is common to keep the structure of the constraints explicit instead of folding them 
into an amorphous set. 

Although it is possible to derive the zero-order conditions for (45) all at 
once, treating both equality and inequality constraints together, it is notationally 
cumbersome to do so and it may obscure the basic simplicity of the arguments. 
For this reason, we treat equality constraints first, then inequality constraints, and 
finally the combination of the two. 

The equality problem is 


minimize f(x) 
subject to h(x) =0 (46) 
xe. 


Letting Y = E”, we have h(x) € Y for all x. For this problem we require a regularity 
condition. 
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Definition. An affine function h is regular with respect to © if the set C in Y 
defined by C = {y : h(x) = y for some x € Q} contains an open sphere around 
0; that is, C contains a set of the form {y : |y| < ¢} for some e > 0. 


This condition means that h(x) can attain 0 and can vary in arbitrary directions 
from 0. 

Notice that this condition is similar to the definition of a regular point in the 
context of first-order conditions. If h has continuous derivatives at a point x* the 
earlier regularity condition implies that Vh(x*) is of full rank and the Implicit 
Function Theorem (of Appendix A) then guarantees that there is an ¢ > 0 such that 
for any y with |y — h(x*)| < e there is an x such that h(x) = y. In other words, 
there is an open sphere around y* = h(x*) that is attainable. In the present situation 
we assume this attainability directly, at the point 0 € Y. 

Next we introduce the following important construction. 


Definition. The primal function associated with problem (46) is 
w(y) = inf{ f(x) : h(x) = y, x € O}, 


defined for all y € C. 


Notice that the primal function is defined by varying the right hand side of the 
constraint. The original problem (46) corresponds to w(0). The primal function is 
illustrated in Fig. 11.6. 


Proposition 1. Suppose Q, is convex, the function f is convex, and h is affine. 
Then the primal function w is convex. 


Proof. For simplicity of notation we assume that © is the entire space X. Then 
we observe 
w(ay, + (1 — @)y5) = inf{ f(x) : h(x) = ay, + (1 — @)yo} 
< inf {f(x) :x = ax, + (1 —a@)x,, h(x,) = y,, h(x,) = yp} 
< ainf{f(x,) : h(x,) =y,}+ (1 — a)inf{fx,) : h(x,) = yo} 
=a(y,)+(1—a)o(y,). fl 


wy) 


Fig. 11.6 The primal function 
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We now turn to the derivation of the Lagrange multiplier result for (46). 


Proposition 2. Assume that OQ, C E" is convex, f is a convex function on Q, 
and h is an m-dimensional affine function on Q.. Assume that h is regular with 
respect to ©. If x* solves (46), then there is N € E™ such that x* solves the 
Lagrangian problem 


minimize f(x) +A"h(x) 


subject to xé€Q. 


Proof. Let f* = f(x*). Define the sets A and B in E”"*! as 


A= {(r,y):r = oly), ye C} 
B={(,y) rss". y=). 


A is the epigraph of w (see Section 7.6) and B is the vertical line extending below 
f* and aligned with the origin. Both A and B are convex sets. Their only common 
point is at (f*, 0). See Fig. 11.7 

According to the separating hyperplane theorem (Appendix B), there is a 
hyperplane separating A and B. This hyperplane can be represented by a nonzero 
vector in E+! of the form (s, X), with X € E’, and a separation constant c. The 
separation conditions are 


sr+\'y>c forall (r,y)€A 
sr+X"y <c for all (r,y) € B. 


It follows immediately that s > 0 for otherwise points (r, 0) € B with r very negative 
would violate the second inequality. 


Hyperplane 


Fig. 11.7 The sets A and B and the separating hyperplane 
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Geometrically, if s = 0 the hyperplane would be vertical. We wish to show 
that s ~ 0, and it is for this purpose that we make use of the regularity condition. 
Suppose s = 0. Then A £0 since both s and A cannot be zero. It follows from the 
second separation inequality that c = 0 because the hyperplane must include the 
point (f*, 0). Now, as y ranges over a sphere centered at 0 € C, the left hand side 
of the first separation inequality ranges correspondingly over Ny which is negative 
for some y’s. This contradicts the first separation inequality. Thus s 4 0 and thus 
in fact s > 0. Without loss of generality we may, by rescaling if necessary, assume 
that s= 1. 

Finally, suppose x € Q. Then (f(x), h(x)) € A and (f(x*), 0) € B. Thus, from 
the separation inequality (with s = 1) we have 


f(x) + A7H(x) = f(x") = f(x") + ATH"). 
Hence x* solves the stated minimization problem. i 


Example 1 (Best rectangle). Consider the classic problem of finding the rectangle 
of of maximum area while limiting the perimeter to a length of 4. This can be 
formulated as 


minimize —x,X, 
subject to x,+x,—-2=0 


x,2>0, x20. 


The regularity condition is met because it is possible to make the right hand side of 
the functional constraint slightly positive or slightly negative with nonnegative x, 
and x,. We know the answer to the problem is x, = x, = 1. The Lagrange multiplier 
is \ = 1. The Lagrangian problem of Proposition 2 is 


minimize —x,x,+1-(*%,+x,—2) 


subject to x,>0, x, >0. 


This can be solved by differentiation to obtain x, = x, = 1. 

However the conclusion of the proposition is not satisfied! The value of the 
Lagrangian at the solution is V= —1+1+1—2=-—1. However, at x, =x, =0 
the value of the Lagrangian is V’ = —2 which is less than V. The Lagrangian is 
not minimized at the solution. The proposition breaks down because the objective 
function f(x,, x.) = —x,x, is not convex. 


Example 2 (Best diagonal). As an alternative problem, consider minimizing the 


length of the diagonal of a rectangle subject to the perimeter being of length 4. This 
problem can be formulated as 


1 
minimize a (x7 +.x5) 
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subject to x, +x,-2=0 
x,20, x,>0. 


In this case the objective function is convex. The solution is x, = x, = 1 and the 
Lagrange multiplier is \ = —1. The Lagrangian problem is 


1 
minimize 51 +45)—1- (x; +x) —2) 
subject to x,>0, x, >0. 


The value of the Lagrangian at the solution is V = | which in this case is a minimum 
as guaranteed by the proposition. (The value at x, = x, =0 is V’ = 2.) 


Inequality constraints 


We outline the parallel results for the inequality constrained problem 


minimize f(x) 
subject to g(x) < 0 
xed, (47) 
where g is a p-dimensional function. 
We let Z = E? and define D C Z as D = {z € Z: g(x) < z for some x € O}. The 


regularity condition (called the S/ater condition) is that there is az, € D with z, <0. 
As before we introduce the primal function. 


Definition. The primal function associated with problem (47) is 


w(z) = inf { f(x) : g(x) <z, xe QO}. 


The primal function is again defined by varying the right hand side of the 
constraint function, using the variable z. Now the primal function in monotonically 
decreasing with z, since an increase in z enlarges the constraint region. 


Proposition 3. Suppose QC E" is convex and f and g are convex functions. 
Then the primal function w is also convex. 


Proof. The proof parallels that of Proposition 1. One simply substitutes g(x) < 0 
for h(x) = y throughout the series of inequalities. JJ 


The zero-order necessary Lagrangian conditions are then given by the 
proposition below. 


Proposition 4. Assume ©. is a convex subset of E" and that f and g are 
convex functions. Assume also that there is a point x, € © such that g(x,) < 0. 


11.9 Zero-Order Conditions and Lagrange Multipliers 351 


Then, if x* solves (47), there is a vector pw € E? with p > 0 such that x* solves 
the Lagrangian problem 


minimize f(x*)+p"g(x) (48) 
subject to xEQ. 
Furthermore, p17 g(x*) = 0. 
Proof. Here is the proof outline. Let f* = f(x*). In this case define in E’*! the 
two sets 
A={(r, 0): r = f(x), 0 = g(x), for some x € 0} 
B={(r,0): r< f*,0<0}. 
A is the epigraph of the primal function w. The set B is the rectangular region at 
or to the left of the vertical axis and at or lower than f*. Both A and B are convex. 
See Fig. 11.8. 


The proof is made by constructing a hyperplane separating A and B. The 
regularity condition guarantees that this hyperplane is not vertical. J 


The condition 7 g(x*) = 0 is the complementary slackness condition that is 
characteristic of necessary conditions for problems with inequality constraints. 


Example 4. (Quadratic program). Consider the quadratic program 
minimize x’Qx+c’x 
subject to a’x <b 
x>0. 
Let O = {x: x > 0} and g(x) =a’x—b. Assume that the n x n matrix Q is positive 


definite, in which case the objective function is convex. Assuming that b > 0, the 
Slater regularity condition is satisfied. Hence there is a Lagrange multiplier 4 > 0 


Hyperplane 


Fig. 11.8 The sets A and B and the separating hyperplane for inequalities 
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(a scalar in this case) such that the solution x* to the quadratic program is also a 
solution to 


minimize x’Qx+ce’x+p(a’x—b) 


subject to x>0, 


and p.(a’x* — b) = 0. 


Mixed constraints 


The two previous results can be combined to obtain zero-order conditions for the 
problem 


minimize f(x) 
subject to h(x)=0, g(x) <0 (49) 
xe. 


Zero-order Lagrange Theorem. Assume that 0 C E" is a convex set, f and 
g are convex functions of dimension I and p, respectively, and h is affine of 
dimension m. Assume also that h satisfies the regularity condition with respect 
to © and that there is an x, € © with h(x,) = 0 and g(x,) < 0. Suppose x* 
solves (49). Then there are vectors X € E" and wp € E? with p> 0 such that 
x* solves the Lagrangian problem 


minimize f(x) +A’ h(x) + p’g(x) (50) 


subject to xEQ. 


Furthermore, w" g(x*) = 0. 


The convexity requirements of this result are satisfied in many practical 
problems. Indeed convex programming problems are both pervasive and relatively 
well treated by theory and numerical methods. The corresponding theory also 
motivates many approaches to general nonlinear programming problems. In fact, 
it will be apparent that many methods attempt to “convexify” a general nonlinear 
problem either by changing the formulation of the underlying application or by 
introducing devices that temporarily relax as the method progresses. 


Zero-order sufficient conditions 
The sufficiency conditions are very strong and do not require convexity. 


Proposition 5. (Sufficiency Conditions). Suppose f is a real-valued function 
ona set OC E". Suppose also that h and g are, respectively, m-dimensional 
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and p-dimensional functions on ©. Finally, suppose there are vectors x* € Q, 
NEE”, and p € E? with wp > 0 such that 


FR) + ATX") + B(x") < f(x) + ATH(X) + ph g(x) 
for all x € O. Then x* solves 


minimize f(x) 

subject to h(x) = h(x*) 
g(x) < g(x") 
xe. 


Proof. Suppose there is x, € with f(x,) < f(x*), h(x,) = h(x*), and g(x,) < 
g(x*). From p > 0 it is clear that w7g(x,) < p’g(x*). It follows that f(x,)+ 
ATh(x,) + p72(x,) < f(x*) + A7h(x*) +p" g(x*), which is a contradiction. J 


This result suggests that Lagrange multiplier values might be guessed and used 
to define a Lagrangian which is subsequently minimized. This will produce a special 
value of x and special values of the right hand sides of the constraints for which 
this x is optimal. Indeed, this approach is characteristic of duality methods treated 
in Chapter 14. 

The theory of this section has an inherent geometric simplicity captured clearly 
by Figs. 11.7 and 11.8. It raises ones’s level of understanding of Lagrange multipliers 
and sets the stage for the theory of convex duality presented in Chapter 14. It is 
certainly possible to jump ahead and read that now. 


11.10 SUMMARY 


Given a minimization problem subject to equality constraints in which all functions 
are smooth, a necessary condition satisfied at a minimum point is that the gradient 
of the objective function is orthogonal to the tangent plane of the constraint surface. 
If the point is regular, then the tangent plane has a simple representation in terms of 
the gradients of the constraint functions, and the above condition can be expressed 
in terms of Lagrange multipliers. 

If the functions have continuous second partial derivatives and Lagrange multi- 
pliers exist, then the Hessian of the Lagrangian restricted to the tangent plane plays 
a role in second-order conditions analogous to that played by the Hessian of the 
objective function in unconstrained problems. Specifically, the restricted Hessian 
must be positive semidefinite at a relative minimum point and, conversely, if it is 
positive definite at a point satisfying the first-order conditions, that point is a strict 
local minimum point. 

Inequalities are treated by determining which of them are active at a solution. 
An active inequality then acts just like an equality, except that its associated 
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Lagrange multiplier can never be negative because of the sensitivity interpretation 
of the multipliers. 

The necessary conditions for convex problems can be expressed without deriva- 
tives, and these are according termed zero-order conditions. These conditions are 
highly geometric in character and explicitly treat the Lagrange multiplier as a vector 
in a space having dimension equal to that of the right-hand-side of the constraints. 
This Lagrange multiplier vector defines a hyperplane that separates the epigraph 
of the primal function from a set of unattainable objective and constraint value 
combinations. 


11.11 EXERCISES 


1. In E? consider the constraints 


x} 


WV WwW 


0 
xX, 20 


Xy — (xX, — 1)? <0. 


Show that the point x, = 1, x, =0 is feasible but is not a regular point. 


2. Find the rectangle of given perimeter that has greatest area by solving the first-order 
necessary conditions. Verify that the second-order sufficiency conditions are satisfied. 


3. Verify the second-order conditions for the entropy example of Section 11.4. 


4. A cardboard box for packing quantities of small foam balls is to be manufactured as 
shown in Fig. 11.9. The top, bottom, and front faces must be of double weight (i.e., 
two pieces of cardboard). A problem posed is to find the dimensions of such a box that 
maximize the volume for a given amount of cardboard, equal to 72 sq. ft. 


a) What are the first-order necessary conditions? 
b) Find x, y, z. 
c) Verify the second-order conditions. 


| 


—___—__ Xx ———__ » 


Fig. 11.9 Packing box 
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5. Define 
4 3 2 
L=; 3 1 1 |, h=(1, 1,0), 
2 


and let M be the subspace consisting of those points x = (x,, x», x3) satisfying h’x = 0. 


a) Find Ly. 
b) Find the eigenvalues of Ly. 
c) Find 


0 nT 
p(n) = det GR Ie 


d) Apply the bordered Hessian test. 


6. Show that z?x = 0 for all x satisfying Ax = 0 if and only if z= A’ w for some w. (Hint: 
Use the Duality Theorem of Linear Programming.) 


7. After a heavy military campaign a certain army requires many new shoes. The quarter- 
master can order three sizes of shoes. Although he does not know precisely how many 
of each size are required, he feels that the demand for the three sizes are independent 
and the demand for each size is uniformly distributed between zero and three thousand 
pairs. He wishes to allocate his shoe budget of four thousand dollars among the three 
sizes so as to maximize the expected number of men properly shod. Small shoes cost 
one dollar per pair, medium shoes cost two dollars per pair, and large shoes cost four 
dollars per pair. How many pairs of each size should he order? 


8. Optimal control. A one-dimensional dynamic process is governed by a difference 
equation 


x(k +1) = (x(k), u(k), k) 


with initial condition x(0) = x. In this equation the value x(k) is called the state at step 
k and u(k) is the control at step k. Associated with this system there is an objective 
function of the form 


J => W(x(k), u(k), k). 
k=0 


In addition, there is a terminal constraint of the form 
g(x(N + 1)) =0. 


The problem is to find the sequence of controls u(0), u(1), u(2),..., u(N) and corre- 
sponding state values to minimize the objective function while satisfying the terminal 
constraint. Assuming all functions have continuous first partial derivatives and that the 
regularity condition is satisfied, show that associated with an optimal solution there is a 
sequence \(k),k =0,1,..., N and a yw such that 


MK 1) = (kK) oh, (x(k), u(k), k) + Ws, (x(k), u(k), k), k=1,2,...,N 


AN) = we, (X(N + 1)) 
W,(x(k), u(k), k) + (kb, (x(k), u(k), k) =0, k=0,1,2,...,N. 
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Generalize Exercise 9 to include the case where the state x(k) is an n-dimensional vector 
and the control u(k) is an m-dimensional vector at each stage k. 


An egocentric young man has just inherited a fortune F and is now planning how to 
spend it so as to maximize his total lifetime enjoyment. He deduces that if x(k) denotes 
his capital at the beginning of year k, his holdings will be approximately governed by 
the difference equation 
x(k + 1) = ax(k) — u(k) 
x(0) = F, 

where @ > | (with a— | as the interest rate of investment) and where u(k) is the amount 
spent in year k. He decides that the enjoyment achieved in year k can be expressed as 


w(u(k)) where w, his utility function, is a smooth function, and that his total lifetime 
enjoyment is 


J= Vo wulky) BS 
k=0 


where the term B* (0 < 6 < 1) reflects the notion that future enjoyment is counted 
less today. The young man wishes to determine the sequence of expenditures that will 
maximize his total enjoyment subject to the condition x(N + 1) = 0. 


a) Find the general optimality relationship for this problem. 
b) Find the solution for the special case w(u) = u!/”. 


Let A be an m x n matrix of rank m and let L be an n x n matrix that is symmetric and 
positive definite on the subspace M = {x : Ax = 0}. Show that the (n+ m) x (n+m) 


matrix 
L <A’ 
A 0 


is nonsingular. 


Consider the quadratic program 
| + E 
minimize 5% Qx —b’x 
subject to Ax=c. 


Prove that x* is a local minimum point if and only if it is a global minimum point. (No 
convexity is assumed.) 


Maximize 14x — x? +6y—y?+7 subject to x+y <2, x+2y <3. 


In the quadratic program example of Section 11.9, what are more general conditions on 
a and b that satisfy the Slater condition? 


What are the general zero-order Lagrangian conditions for the problem (46) without the 
regularity condition? [The coefficient of f will be zero, so there is no real condition.] 


Show that the problem of finding the rectangle of maximum area with a diagonal of 
unit length can be formulated as an unconstrained convex programming problem using 
trigonometric functions. [Hint: use variable @ over the range 0 < 6 < 45 degrees. ] 
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Chapter12 PRIMAL METHODS 


In this chapter we initiate the presentation, analysis, and comparison of algorithms 
designed to solve constrained minimization problems. The four chapters that 
consider such problems roughly correspond to the following classification scheme. 
Consider a constrained minimization problem having n variables and m constraints. 
Methods can be devised for solving this problem that work in spaces of dimension 
n—m,n,m, or n+m. Each of the following chapters corresponds to methods in 
one of these spaces. Thus, the methods in the different chapters represent quite 
different approaches and are founded on different aspects of the theory. However, 
there are also strong interconnections between the methods of the various chapters, 
both in the final form of implementation and in their performance. Indeed, there 
soon emerges the theme that the rates of convergence of most practical algorithms 
are determined by the structure of the Hessian of the Lagrangian much like the 
structure of the Hessian of the objective function determines the rates of conver- 
gence for a wide assortment of methods for unconstrained problems. Thus, although 
the various algorithms of these chapters differ substantially in their motivation, they 
are ultimately found to be governed by a common set of principles. 


12.1. ADVANTAGE OF PRIMAL METHODS 


We consider the question of solving the general nonlinear programming problem 


minimize f(x) 
subject to g(x) <0 
h(x) = 0, (1) 


where x is of dimension n, while f, g, and h have dimensions 1, p, and m, respec- 
tively. It is assumed throughout the chapter that all of the functions have continuous 
partial derivatives of order three. Geometrically, we regard the problem as that of 
minimizing f over the region in E” defined by the constraints. 

By a primal method of solution we mean a search method that works on 
the original problem directly by searching through the feasible region for the 
optimal solution. Each point in the process is feasible and the value of the objective 
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function constantly decreases. For a problem with n variables and having m equality 
constraints only, primal methods work in the feasible space, which has dimension 
n—m. 

Primal methods possess three significant advantages that recommend their use 
as general procedures applicable to almost all nonlinear programming problems. 
First, since each point generated in the search procedure is feasible, if the process 
is terminated before reaching the solution (as practicality almost always dictates 
for nonlinear problems), the terminating point is feasible. Thus this final point is a 
feasible and probably nearly optimal solution to the original problem and therefore 
may represent an acceptable solution to the practical problem that motivated the 
nonlinear program. A second attractive feature of primal methods is that, often, 
it can be guaranteed that if they generate a convergent sequence, the limit point 
of that sequence must be at least a local constrained minimum. Finally, a major 
advantage is that most primal methods do not rely on special problem structure, 
such as convexity, and hence these methods are applicable to general nonlinear 
programming problems. 

Primal methods are not, however, without major disadvantages. They require a 
phase I procedure (see Section 3.5) to obtain an initial feasible point, and they are all 
plagued, particularly for problems with nonlinear constraints, with computational 
difficulties arising from the necessity to remain within the feasible region as the 
method progresses. Some methods can fail to converge for problems with inequality 
constraints unless elaborate precautions are taken. 

The convergence rates of primal methods are competitive with those of other 
methods, and particularly for linear constraints, they are often among the most 
efficient. On balance their general applicability and simplicity place these methods 
in a role of central importance among nonlinear programming algorithms. 


12.2 FEASIBLE DIRECTION METHODS 


The idea of feasible direction methods is to take steps through the feasible region 
of the form 


X41 =X +a, d,, (2) 


where d, is a direction vector and a, is a nonnegative scalar. The scalar is chosen 
to minimize the objective function f with the restriction that the point x,,,; and 
the line segment joining x, and x,,, be feasible. Thus, in order that the process 
of minimizing with respect to a be nontrivial, an initial segment of the ray x, + 
ad,,a@ > 0 must be contained in the feasible region. This motivates the use of 
feasible directions for the directions of search. We recall from Section 7.1 that a 
vector d, is a feasible direction (at x,) if there is an @ > 0 such that x, + ad, 
is feasible for all a,0 < a <a. A feasible direction method can be considered 
as a natural extension of our unconstrained descent methods. Each step is the 
composition of selecting a feasible direction and a constrained line search. 
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Example 1 (Simplified Zoutendijk method). One of the earliest proposals for a 
feasible direction method uses a linear programming subproblem. Consider the 
problem with linear inequality constraints 


minimize f(x) (3) 


subject to. ajx <b, 


Given a feasible point, x,, let J be the set of indices representing active constraints, 
that is, ax, = b; for i € I. The direction vector d, is then chosen as the solution to 
the linear program 


minimize W/f(x,)d 


subject to a’d<0, iel (4) 


Ye Id;| = 1, 


i=1 


where d= (d,, dy, ..., d,,). The last equation is a normalizing equation that ensures 
a bounded solution. (Even though it is written in terms of absolute values, the 
problem can be converted to a linear program; see Exercise 1.) The other constraints 
assure that vectors of the form x, + ad, will be feasible for sufficiently small a > 0, 
and subject to these conditions, d is chosen to line up as closely as possible with the 
negative gradient of f. In some sense this will result in the locally best direction in 
which to proceed. The overall procedure progresses by generating feasible directions 
in this manner, and moving along them to decrease the objective. 

There are two major shortcomings of feasible direction methods that require 
that they be modified in most cases. The first shortcoming is that for general 
problems there may not exist any feasible directions. If, for example, a problem had 
nonlinear equality constraints, we might find ourselves in the situation depicted by 
Fig. 12.1 where no straight line from x, has a feasible segment. For such problems 
it is necessary either to relax our requirement of feasibility by allowing points to 
deviate slightly from the constraint surface or to introduce the concept of moving 
along curves rather than straight lines. 


Feasible 
set 


XK 


Fig. 12.1 No feasible direction 
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A second shortcoming is that in simplest form most feasible direction methods 
are not globally convergent. They are subject to jamming (sometimes referred to as 
zigzagging) where the sequence of points generated by the process converges to a 
point that is not even a constrained local minimum point. This phenomenon can be 
explained by the fact that the algorithmic map is not closed. 

The algorithm associated with a method of feasible directions can generally be 
written as the composition of two maps A = MD, where D is a map that selects a 
direction and M is the map corresponding to constrained minimization in the given 
direction. (We use the new notation M rather than S, since now the line search 
is constrained to the feasible region.) Unfortunately, it is quite often the case in 
feasible direction methods that M and D are not both closed. 


Example 2 (M not closed). Consider the region shown in Fig. 12.2 together with 
the sequence of feasible points {x,} and feasible directions {d,}. We have x, > x* 
and d, — d*. Also from the diagram and the direction of Vf’ it is clear that 


M(x,,d,) =X,,; > x, M(x*,d*)=yFx". 
Thus M is not closed at (x*, d*). 


Example 3 (D not closed). In the simplified method presented in Example 1, the 
feasible direction selection map D is not closed. This can be seen from Fig. 12.3 
where the directions are shown for a convergent sequence of points, and the limiting 
direction is not equal to the direction at the limiting point. Basically, nonclosedness 


q, 


Fig. 12.2 Example of M not closed 
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ok 
X; XX x 


Fig. 12.3 Example of D not closed 


is caused in this case by the fact that the method used for generating the feasible 
direction changes suddenly when an additional constraint becomes active. 


It is possible to develop feasible direction algorithms that are closed and hence 
not subject to jamming. Some procedures for doing so are discussed in Exercises 4 to 
7. However, such methods can become somewhat complicated. A simpler approach 
for treating inequality constraints is to use an active set method, as discussed in the 
next section. 


12.3. ACTIVE SET METHODS 


The idea underlying active set methods is to partition inequality constraints into 
two groups: those that are to be treated as active and those that are to be treated as 
inactive. The constraints treated as inactive are essentially ignored. 

Consider the constrained problem 


minimize f(x) 


(5) 


subject to g(x) <0, 


which for simplicity of the current discussion is taken to have inequality constraints 
only. The inclusion of equality constraints is straightforward, as will become clear. 
The necessary conditions for this problem are 


V f(x) +A7Vg(x) = 0 
g(x) <0 

h’ g(x) =0 

ASO. 


(6) 
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(See Section 11.8.) These conditions can be expressed in a somewhat simpler form 
in terms of the set of active constraints. Let A denote the index set of active 
constraints; that is, A is the set of i such that g,;(x*) = 0. Then the necessary 
conditions (6) become 


VAR) + DA Vsi(x) = 0 


icA 
g,(x) = 0, ie€A 
g(x) <0, igA (7) 
A; > 0, ieA 


L 


h,=0, i¢A 


The first two lines of these conditions correspond identically to the necessary 
conditions of the equality constrained problem obtained by requiring the active 
constraints to be zero. The next line guarantees that the inactive constraints are 
satisfied, and the sign requirement of the Lagrange multipliers guarantees that every 
constraint that is active should be active. 

It is clear that if the active set were known, the original problem could be 
replaced by the corresponding problem having equality constraints only. Alter- 
natively, suppose an active set was guessed and the corresponding equality 
constrained problem solved. Then if the other constraints were satisfied and 
the Lagrange multipliers turned out to be nonnegative, that solution would be 
correct. 

The idea of active set methods is to define at each step, or at each phase, of 
an algorithm a set of constraints, termed the working set, that is to be treated as 
the active set. The working set is chosen to be a subset of the constraints that are 
actually active at the current point, and hence the current point is feasible for the 
working set. The algorithm then proceeds to move on the surface defined by the 
working set of constraints to an improved point. At this new point the working 
set may be changed. Overall, then, an active set method consists of the following 
components: (1) determination of a current working set that is a subset of the current 
active constraints, and (2) movement on the surface defined by the working set to 
an improved point. 

There are several methods for determining the movement on the surface 
defined by the working set. (This surface will be called the working surface.) 
The most important of these methods are discussed in the following sections. 
The direction of movement is generally determined by first-order or second-order 
approximations of the functions at the current point in a manner similar to that 
for unconstrained problems. The asymptotic convergence properties of active set 
methods depend entirely on the procedure for moving on the working surface, 
since near the solution the working set is generally equal to the correct active set, 
and the process simply moves successively on the surface determined by those 
constraints. 
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Changes in Working Set 


Suppose that for a given working set W the problem with equality constraints 


minimize f(x) 
subject to g,(x) = 0, ieW 


is solved yielding the point x,, that satisfies g,(xy) < 0, i ¢ W. This point satisfies 
the necessary conditions 


Vi(xw) + ‘a A; Vgi(xw) =. (8) 


icW 


If A; > 0 for all i € W, then the point xy is a local solution to the original problem. 
If, on the other hand, there is an i € W such that A; < 0, then the objective can 
be decreased by relaxing constraint i. This follows directly from the sensitivity 
interpretation of Lagrange multipliers, since a small decrease in the constraint value 
from 0 to —c would lead to a change in the objective function of \;c, which is 
negative. Thus, by dropping the constraint i from the working set, an improved 
solution can be obtained. The Lagrange multiplier of a problem thereby serves as 
an indication of which constraints should be dropped from the working set. This is 
illustrated in Fig. 12.4. In the figure, x is the minimum point of f on the surface (a 
curve in this case) defined by g, (x) = 0. However, it is clear that the corresponding 
Lagrange multiplier \, is negative, implying that g, should be dropped. Since Vf 
points outside, it is clear that a movement toward the interior of the feasible region 
will indeed decrease f/f. 

During the course of minimizing f(x) over the working surface, it is necessary 
to monitor the values of the other constraints to be sure that they are not violated, 
since all points defined by the algorithm must be feasible. It often happens that 
while moving on the working surface a new constraint boundary is encountered. It 
is then convenient to add this constraint to the working set, proceeding on a surface 
of one lower dimension than before. This is illustrated in Fig. 12.5. In the figure 
the working constraint is just g; =O for x,,X),x3. A boundary is encountered at 
the next step, and therefore g, = 0 is adjoined to the set of working constraints. 


VfT 


Feasible region 


g,=0 g,=0 


Fig. 12.4 Constraint to be dropped 
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Fig. 12.5 Constraint added to working set 


A complete active set strategy for systematically dropping and adding 
constraints can be developed by combining the above two ideas. One starts with a 
given working set and begins minimizing over the corresponding working surface. 
If new constraint boundaries are encountered, they may be added to the working 
set, but no constraints are dropped from the working set. Finally, a point is obtained 
that minimizes f with respect to the current working set of constraints. The corre- 
sponding Lagrange multipliers are determined, and if they are all nonnegative the 
solution is optimal. Otherwise, one or more constraints with negative Lagrange 
multipliers are dropped from the working set. The procedure is reinitiated with this 
new working set, and f will strictly decrease on the next step. 

An active set method built upon this basic active set strategy requires that a 
procedure be defined for minimization on a working surface that allows constraints 
to be added to the working set when they are encountered, and that, after dropping 
a constraint, insures that the objective is strictly decreased. Such a method is 
guaranteed to converge to the optimal solution, as shown below. 


Active Set Theorem. Suppose that for every subset W of the constraint indices, 
the constrained problem 


minimize f(x) (9) 
subject to g,(x) =0, ieW 

is well-defined with a unique nondegenerate solution (that is, for all i € W, 
A; #0). Then the sequence of points generated by the basic active set strategy 
converges to the solution of the inequality constrained problem (6). 


Proof. After the solution corresponding to one working set is found, a decrease 
in the objective is made, and hence it is not possible to return to that working set. 
Since there are only a finite number of working sets, the process must terminate. ff 


The difficulty with the above procedure is that several problems with incorrect 
active sets must be solved. Furthermore, the solutions to these intermediate problems 
must, in general, be exact global minimum points in order to determine the correct 
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sign of the Lagrange multipliers and to assure that during the subsequent descent 
process the current working surface is not encountered again. 

In practice one deviates from the ideal basic method outlined above by dropping 
constraints using various criteria before an exact minimum on the working surface 
is found. Convergence cannot be guaranteed for many of these methods, and indeed 
they are subject to zigzagging (or jamming) where the working set changes an 
infinite number of times. However, experience has shown that zigzagging is very 
rare for many algorithms, and in practice the active set strategy with various 
refinement is often very effective. 

It is clear that a fundamental component of an active set method is the algorithm 
for solving a problem with equality constraints only, that is, for minimizing on the 
working surface. Such methods and their analyses are presented in the following 
sections. 


12.4 THE GRADIENT PROJECTION METHOD 


The gradient projection method is motivated by the ordinary method of steepest 
descent for unconstrained problems. The negative gradient is projected onto the 
working surface in order to define the direction of movement. We present it here 
in a simplified form that is based on a pure active set strategy. 


Linear Constraints 


Consider first problems of the form 


minimize f(x) 
subject to a’x <b, iel, (10) 


L 


ax= b;, iel, 


having linear equalities and inequalities. 

A feasible solution to the constraints, if one exists, can be found by application 
of the phase I procedure of linear programming; so we shall always assume that 
our descent process is initiated at such a feasible point. At a given feasible point x 
there will be a certain number q of active constraints satisfying a) x = b, and some 
inactive constraints a/x < b;. We initially take the working set W(x) to be the set 
of active constraints. 

At the feasible point x we seek a feasible direction vector d satisfying 
Vf(x)d < 0, so that movement in the direction d will cause a decrease in the 
function f. Initially, we consider directions satisfying a/d = 0, i € W(x) so that 
all working constraints remain active. This requirement amounts to requiring that 
the direction vector d lie in the tangent subspace M defined by the working set of 
constraints. The particular direction vector that we shall use is the projection of the 
negative gradient onto this subspace. 

To compute this projection let A, be defined as composed of the rows of 
working constraints. Assuming regularity of the constraints, as we shall always 
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assume, A, will be a q xn matrix of rank q <n. The tangent subspace M in 
which d must lie is the subspace of vectors satisfying A,d = 0. This means that 
the subspace N consisting of the vectors making up the rows of A, (that is, all 
vectors of the form AiA for X € E“) is orthogonal to M. Indeed, any vector can be 
written as the sum of vectors from each of these two complementary subspaces. In 
particular, the negative gradient vector —g, can be written 


—g. = 4, + APA, (11) 


where d, € M and A, € E%. We may solve for A, through the requirement that 
A,d, = 0. Thus 


A,d, = —A,g, —(A,A7)A, = 0, (12) 
which leads to 
A,= —(A,A; TAL (13) 
and 
d, = —[I-Aj(A,A; “TAL ]g; = —P,g;. (14) 
The matrix 
P, = [[— Ai (A,Aj) Ag] (15) 


is called the projection matrix corresponding to the subspace M. Action by it on 
any vector yields the projection of that vector onto M. See Exercises 8 and 9 for 
other derivations of this result. 

We easily check that if d, ¢ 0, then it is a direction of descent. Since g, + d, 
is orthogonal to d,, we have 


gd, = (g; +d/ —d;)d, = —|d,|’. 


Thus if d, as computed from (14) turns out to be nonzero, it is a feasible direction 
of descent on the working surface. 

We next consider selection of the step size. As @ is increased from zero, the 
point x+ ad will initially remain feasible and the corresponding value of f will 
decrease. We find the length of the feasible segment of the line emanating from x 
and then minimize f over this segment. If the minimum occurs at the endpoint, a 
new constraint will become active and will be added to the working set. 

Next, consider the possibility that the projected negative gradient is zero. We 
have in that case 


VA(%) + ATA, = 0, (16) 
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and the point x, satisfies the necessary conditions for a minimum on the working 
surface. If the components of A, corresponding to the active inequalities are all 
nonnegative, then this fact together with (16) implies that the Karush-Kuhn-Tucker 
conditions for the original problem are satisfied at x, and the process terminates. In 
this case the A, found by projecting the negative gradient is essentially the Lagrange 
multiplier vector for the original problem (except that zero-valued multipliers must 
be appended for the inactive constraints). 

If, however, at least one of those components of A, is negative, it is possible, 
by relaxing the corresponding inequality, to move in a new direction to an improved 
point. Suppose that \,,, the jth component of A,, is negative and the indexing 
is arranged so that the corresponding constraint is the inequality ayx < b;. We 
determine the new direction vector by relaxing the jth constraint and projecting 
the negative gradient onto the subspace determined by the remaining g — | active 
constraints. Let A; denote the matrix A, with row a; deleted. We have for some Ay, 


Tk = ADM (17) 
8 =d,+A7A,, (18) 


where d, is the projection of —g, using A,. It is immediately clear that d, 49, since 
otherwise (18) would be a special case of (17) with \, = 0 which is impossible, 
since the rows of A, are linearly independent. From our previous work we know 
that gf d, <0. Multiplying the transpose of (17) by d, and using Azd, = 0 we 
obtain 


0 > gid, =—,aj dy. (19) 


Since Aj, <0 we conclude that aid, <0. Thus the vector d, is not only a direction 
of descent, but it is a feasible direction, since a7d, =0,i € W(x,), i4% j, and 
avd, <0. Hence j can be dropped from W(x,). 

In summary, one step of the algorithm is as follows: Given a feasible point x 


ay 


. Find the subspace of active constraints M, and form A,, W(x). 
2. Calculate P=I—A/(A,A/)"'A, and d= —PVf(x)’. 
3. If d~0, find a, and a, achieving, respectively, 


max {a:x+ad is feasible } 


min {f(x+ad):0<a<ayj}. 


Set x to x+ a,d and return to (1). 
4. If d=0, find X= —(A,A7) ‘A, Vf(x)’. 


a) If \; > 0, for all j corresponding to active inequalities, stop; x satisfies the 
Karush-Kuhn-Tucker conditions. 

b) Otherwise, delete the row from A, corresponding to the inequality with the 
most negative component of A (and drop the corresponding constraint from 
W(x)) and return to (2). 
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The projection matrix need not be recomputed in its entirety at each new 
point. Since the set of active constraints in the working set changes by at most one 
constraint at a time, it is possible to calculate one required projection matrix from 
the previous one by an updating procedure. (See Exercise 11.) This is an important 
feature of the gradient projection method and greatly reduces the computation 
required at each step. 


Example. Consider the problem 


minimize x; +3 +x3-+xj; — 2x, — 3x, 

subject to 2x, +2x,+x,;+4x,=7 (20) 
X, +X. +2x;+%4,=6 

x,20, i=1,2,3,4. 


Suppose that given the feasible point x = (2,2, 1,0) we wish to find the direction 
of the projected negative gradient. The active constraints are the two equalities and 
the inequality x, > 0. Thus 


21414 
A a= 11214, (21) 
0001 
and hence 
22 9 
A,A; =} 971 
411 
After considerable calculation we then find 
1 6 -—5 -19 
(A AT) '=—] —-5 6 14 
q‘*q 11 
-19 14 73 
and finally 
1 -3 1 0 
1 | -3 9 -3 0 
P= li 1 3 1 ol: (22) 
0 OO 


The gradient at the point (2, 2, 1, 0) is g = (2, 4,2, —3) and hence we find 


1 
d= Feo, | 8, 24, —8, 0), 
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or normalizing by 8/11 
d= (—1,3, —1, 0). (23) 


It can be easily verified that movement in this direction does not violate the 
constraints. 


Nonlinear Constraints 


In extending the gradient projection method to problems of the form 


minimize f(x) 
subject to h(x) =0 (24) 
g(x) <0, 


the basic idea is that at a feasible point x, one determines the active constraints and 
projects the negative gradient onto the subspace tangent to the surface determined 
by these constraints. This vector, if it is nonzero, determines the direction for the 
next step. The vector itself, however, is not in general a feasible direction, since the 
surface may be curved as illustrated in Fig. 12.6. It is therefore not always possible 
to move along this projected negative gradient to obtain the next point. 

What is typically done in the face of this difficulty is essentially to search 
along a curve on the constraint surface, the direction of the curve being defined by 
the projected negative gradient. A new point is found in the following way: First, 
a move is made along the projected negative gradient to a point y. Then a move 
is made in the direction perpendicular to the tangent plane at the original point to 
a nearby feasible point on the working surface, as illustrated in Fig. 12.6. Once 
this point is found the value of the objective is determined. This is repeated with 


VAX," 


Constraint 
surface 


Fig. 12.6 Gradient projection method 
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various y’s until a feasible point is found that satisfies one of the standard descent 
criteria for improvement relative to the original point. 

This procedure of tentatively moving away from the feasible region and then 
coming back introduces a number of additional difficulties that require a series of 
interpolations and nonlinear equation solutions for their resolution. A satisfactory 
general routine implementing the gradient projection philosophy is therefore of 
necessity quite complex. It is not our purpose here to elaborate on these details but 
simply to point out the general nature of the difficulties and the basic devices for 
surmounting them. 

One difficulty is illustrated in Fig. 12.7. If, after moving along the projected 
negative gradient to a point y, one attempts to return to a point that satisfies the 
old active constraints, some inequalities that were originally satisfied may then be 
violated. One must in this circumstance use an interpolation scheme to find a new 
point y along the negative gradient so that when returning to the active constraints 
no originally nonactive constraint is violated. Finding an appropriate y is to some 
extent a trial and error process. Finally, the job of returning to the active constraints 
is itself a nonlinear problem which must be solved with an iterative technique. Such 
a technique is described below, but within a finite number of iterations, it cannot 
exactly reach the surface. Thus typically an error tolerance 6 is introduced, and 
throughout the procedure the constraints are satisfied only to within 6. 

Computation of the projections is also more difficult in the nonlinear case. 
Lumping, for notational convenience, the active inequalities together with the equal- 
ities into h(x,), the projection matrix at x, is 


P, = 1— Vh(x,)"[Vh(x,) Vh(x,)"} Vh(x,). (25) 


At the point x, this matrix can be updated to account for one more or one less 
constraint, just as in the linear case. When moving from x, to x,,,;, however, Vh 
will change and the new projection matrix cannot be found from the old, and hence 
this matrix must be recomputed at each step. 


-vf" 


Fig. 12.7 Interpolation to obtain feasible point 
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The most important new feature of the method is the problem of returning to 
the feasible region from points outside this region. The type of iterative technique 
employed is a common one in nonlinear programming, including interior-point 
methods of linear programming, and we describe it here. The idea is, from any 
point near x,, to move back to the constraint surface in a direction orthogonal 
to the tangent plane at x,. Thus from a point y we seek a point of the form 
y+ Vh(x,)?@ = y* such that h(y*) = 0. As shown in Fig. 12.8 such a solution may 
not always exist, but it does for y sufficiently close to x,. 

To find a suitable first approximation to a, and hence to y*, we linearize the 
equation at x, obtaining 


h(y + Vh(x,)"a) ~ h(y) + Vh(x,) Va(x,)"@, (26) 


the approximation being accurate for || and |y —x| small. This motivates the first 
approximation 


a, = —[Vh(x,) Vh(x,)/]'h(y) (27) 
y, =y— Vh(x,)/[Vh(x,) Va(x,)" | 'h(y). (28) 


Substituting y, for y and successively repeating the process yields the sequence 
{y;} generated by 


Yjut = Yj — Vh(x,)/ [Vh(x,) Vi(x,)"]'h(y,), (29) 


which, started close enough to x, and the constraint surface, will converge to a 
solution y*. We note that this process requires the same matrices as the projection 
operation. 

The gradient projection method has been successfully implemented and has 
been found to be effective in solving general nonlinear programming problems. 
Successful implementation resolving the several difficulties introduced by the 
requirement of staying in the feasible region requires, as one would expect, some 
degree of skill. The true value of the method, however, can be determined only 
through an analysis of its rate of convergence. 


Fig. 12.8 Case in which it is impossible to return to surface 
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12.5 CONVERGENCE RATE OF THE GRADIENT 
PROJECTION METHOD 


An analysis that directly attacked the nonlinear version of the gradient projection 
method, with all of its iterative and interpolative devices, would quickly become 
monstrous. To obtain the asymptotic rate of convergence, however, it is not 
necessary to analyze this complex algorithm directly—instead it is sufficient to 
analyze an alternate simplified algorithm that asymptotically duplicates the gradient 
projection method near the solution. Through the introduction of this idealized 
algorithm we show that the rate of convergence of the gradient projection method 
is governed by the eigenvalue structure of the Hessian of the Lagrangian restricted 
to the constraint tangent subspace. 


Geodesic Descent 


For simplicity we consider first the problem having only equality constraints 


minimize f(x) 
(30) 
subject to h(x) =0. 


The constraints define a continuous surface Q, in E”. 

In considering our own difficulties with this problem, owing to the fact that the 
surface is nonlinear thereby making directions of descent difficult to define, it is 
well to also consider the problem as it would be viewed by a small bug confined to 
the constraint surface who imagines it to be his total universe. To him the problem 
seems to be a simple one. It is unconstrained, with respect to his universe, and is 
only (n — m)-dimensional. He would characterize a solution point as a point where 
the gradient of f (as measured on the surface) vanishes and where the appropriate 
(n — m)-dimensional Hessian of f is positive semidefinite. If asked to develop a 
computational procedure for this problem, he would undoubtedly suggest, since 
he views the problem as unconstrained, the method of steepest descent. He would 
compute the gradient, as measured on his surface, and would move along what 
would appear to him to be straight lines. 

Exactly what the bug would compute as the gradient and exactly what he 
would consider as straight lines would depend basically on how distance between 
two points on his surface were measured. If, as is most natural, we assume that he 
inherits his notion of distance from the one which we are using in E”, then the path 
x(t) between two points x, and x, on his surface that minimizes . |x(t)|dt would 
be considered a straight line by him. Such a curve, having minimum arc length 
between two given points, is called a geodesic. 

Returning to our own view of the problem, we note, as we have previously, that 
if we project the negative gradient onto the tangent plane of the constraint surface 
at a point x,, we cannot move along this projection itself and remain feasible. We 
might, however, consider moving along a curve which had the same initial heading 
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as the projected negative gradient but which remained on the surface. Exactly which 
such curve to move along is somewhat arbitrary, but a natural choice, inspired 
perhaps by the considerations of the bug, is a geodesic. Specifically, at a given 
point on the surface, we would determine the geodesic curve passing through that 
point that had an initial heading identical to that of the projected negative gradient. 
We would then move along this geodesic to a new point on the surface having a 
lesser value of /. 

The idealized procedure then, which the bug would use without a second 
thought, and which we would use if it were computationally feasible (which it 
definitely is not), would at a given feasible point x, (see Fig. 12.9): 


1. Calculate the projection p of —V/f(x,)’ onto the tangent plane at x,. 

2. Find the geodesic, x(t),t > 0, of the constraint surface having x(0) = 
x,,X(0) = p. 

3. Minimize f(x(t)) with respect to t > 0, obtaining f, and x,,; = x(t,). 


At this point we emphasize that this technique (which we refer to as geodesic 
descent) is proposed essentially for theoretical purposes only. It does, however, 
capture the main philosophy of the gradient projection method. Furthermore, as 
the step size of the methods go to zero, as it does near the solution point, the 
distance between the point that would be determined by the gradient projection 
method and the point found by the idealized method goes to zero even faster. Thus 
the asymptotic rates of convergence for the two methods will be equal, and it is, 
therefore, appropriate to concentrate on the idealized method only. 

Our bug confined to the surface would have no hesitation in estimating the rate 
of convergence of this method. He would simply express it in terms of the smallest 
and largest eigenvalues of the Hessian of f as measured on his surface. It should 
not be surprising, then, that we show that the asymptotic convergence ratio is 


(a2): ep 


Fig. 12.9 Geodesic descent 


376 Chapter 12 Primal Methods 


where a and A are, respectively, the smallest and largest eigenvalues of L, the 
Hessian of the Lagrangian, restricted to the tangent subspace M. This result parallels 
the convergence rate of the method of steepest descent, but with the eigenvalues 
determined from the same restricted Hessian matrix that is important in the general 
theory of necessary and sufficient conditions for constrained problems. This rate, 
which almost invariably arises when studying algorithms designed for constrained 
problems, will be referred to as the canonical rate. 

We emphasize again that, since this convergence ratio governs the convergence 
of a large family of algorithms, it is the formula itself rather than its numerical 
value that is important. For any given problem we do not suggest that this ratio be 
evaluated, since this would be extremely difficult. Instead, the potency of the result 
derives from the fact that fairly comprehensive comparisons among algorithms can 
be made, on the basis of this formula, that apply to general classes of problems 
rather than simply to particular problems. 

The remainder of this section is devoted to the analysis that is required to 
establish the convergence rate. Since this analysis is somewhat involved and not 
crucial for an understanding of remaining material, some readers may wish to 
simply read the theorem statement and proceed to the next section. 


Geodesics 


Given the surface 0 = {x : h(x) = 0} C E”, a smooth curve, x(t) € O,O<t<T 
starting at x(0) and terminating at x(7) that minimizes the total arc length 


[ wcolae 


with respect to all other such curves on () is said to be a geodesic connecting x(0) 
and x(7). 

It is common to parameterize a geodesic x(t),0 <t< T so that |x(r)| = 1. 
The parameter f¢ is then itself the arc length. If the parameter ¢ is also regarded as 
time, then this parameterization corresponds to moving along the geodesic curve 
with unit velocity. Parameterized in this way, the geodesic is said to be normalized. 
On any linear subspace of E” geodesics are straight lines. On a three-dimensional 
sphere, the geodesics are arcs of great circles. 

It can be shown, using the calculus of variations, that any normalized geodesic 
on © satisfies the condition 


¥(1) = Vi" (x(2))eo(1) (32) 


for some function @ taking values in E’’. Geometrically, this condition says that 
if one moves along the geodesic curve with unit velocity, the acceleration at every 
point will be orthogonal to the surface. Indeed, this property can be regarded as 
the fundamental defining characteristic of a geodesic. To stay on the surface ©, the 
geodesic must also satisfy the equation 


Vh(x(t))x(t) = 0, (33) 
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since the velocity vector at every point is tangent to ]. At a regular point x, 
these two differential equations, together with the initial conditions x(0) = xo, x(0) 
specified, and |x(0)| = 1, uniquely specify a curve x(t), ¢ > 0 that can be continued 
as long as points on the curve are regular. Furthermore, |x(t)| = 1 for t > 0. Hence 
geodesic curves emanate in every direction from a regular point. Thus, for example, 
at any point on a sphere there is a unique great circle passing through the point in 
a given direction. 


Lagrangian and Geodesics 


Corresponding to any regular point x € Q, we may define a corresponding Lagrange 
multiplier X(x) by calculating the projection of the gradient of f onto the tangent 
subspace at x, denoted M(x). The matrix that, when operating on a vector, projects 
it onto M(x) is 


P(x) = I— Vh(x)"[Vh(x) Vh(x)"]-! V(x), 
and it follows immediately that the projection of V(x)” onto M(x) has the form 
y(x) = [VA(x) + A(x)" Vh(x)]’, (34) 
where A(x) is given explicitly as 
A(x)? = —V/(x) Vh(x)? [Vh(x) Vh(x)7 J. (35) 


Thus, in terms of the Lagrangian function I(x, X) = f(x) +A7h(x), the projected 
gradient is 


y(x) = 1. (x, A(x))". (36) 


If a local solution to the original problem occurs at a regular point x* € ©, then 
as we know 


1.(x*, X(x*)) = 0, (37) 


which states that the projected gradient must vanish at x*. Defining L(x) = 
Ix (X, X(X)) = F(x) + A(x)7H(x) we also know that at x* we have the second- 
order necessary condition that L(x*) is positive semidefinite on M(x*); that is, 
z'L(x*)z > 0 for all z € M(x*). Equivalently, letting 


L(x) = P(x)L(x)P(x), (38) 
it follows that L(x*) is positive semidefinite. 


We then have the following fundamental and simple result, valid along a 
geodesic. 
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Proposition 1. Let x(t),0 < t <T, be a geodesic on Q. Then 
£ fx(0)) = h(x, MA))K(D (39) 
© fox(s)) = ¥(" LER). (40) 
Proof. We have 


< f(x) = VS(x(1))x(1) = [x(x A(x))x(0), 


the second equality following from the fact that x(t) € M(x). Next, 


<ftx(o) = X(1)F(x(1))X(1) + V(x(1))X(0). (41) 
But differentiating the relation NX h(x(t)) = 0 twice, for fixed , yields 
x(t)’ NH (x(1))x(t) +A! Vh(x(t))x(t) = 0. 


Adding this to (41), we have 
< fest) = x(t)" (F+A7H)x(t) + (V(x) +A! Vhh(x))X(1), 


which is true for any fixed X. Setting A = A(x) determined as above, (Vf+A' Vh)" 
is in M(x) and hence orthogonal to X(t), since x(t) is a normalized geodesic. This 
gives (40). If 


It should be noted that we proved a simplified version of this result in 
Chapter 11. There the result was given only for the optimal point x*, although it 
was valid for any curve. Here we have shown that essentially the same result is 
valid at any point provided that we move along a geodesic. 


Rate of Convergence 


We now prove the main theorem regarding the rate of convergence. We assume 
that all functions are three times continuously differentiable and that every point 
in a region near the solution x* is regular. This theorem only establishes the rate 
of convergence and not convergence itself so for that reason the stated hypotheses 
assume that the method of geodesic descent generates a sequence {x,} converging 
to x*. 


Theorem. Let x* be a local solution to the problem (30) and suppose that 
A and a> 0 are, respectively, the largest and smallest eigenvalues of L(x*) 
restricted to the tangent subspace M(x*). If {x,} is a sequence generated by 
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the method of geodesic descent that converges to x*, then the sequence of 
objective values {f(x,)} converges to f(x*) linearly with a ratio no greater 
than [(A—a)/(A+a)f’. 


Proof. Without loss of generality we may assume f(x*) = 0. Given a point x, it 
will be convenient to define its distance from the solution point x* as the arc length 
of the geodesic connecting x* and x,. Thus if x(t) is a parameterized version of the 
geodesic with x(0) = x*, |x(t)| = 1,x(7) = x;,, then T is the distance of x, from 
x*. Associated with such a geodesic we also have the family y(t),0 <t< T, of 
corresponding projected gradients y(t) = /, (x, X(x))’, and Hessians L(t) = L(x(1)). 
We write y, = y(x,), L, = L(x,). 

We now derive an estimate for f(x,). Using the geodesic discussed above we 
can write (setting x, = x(T)) 


AO) — flo) = — fm) = VT + PHL, + 0(7"), (42) 
which follows from Proposition |. We also have 
Ye = —y(®") + y(&) = HT +0(7). (43) 
But differentiating (34) we obtain 
J, = Lpke + VK) A; (44) 
and hence if P, is the projection matrix onto M(x,) = M,, we have 
PLY, = PLL X.- (45) 
Multiplying (43) by P, and accounting for P,y, = y, we have 
P,y,T =y, + 0(7). (46) 
Substituting (45) into this we obtain 
P,L,X,T = y, + 0(7). 
Since P,x, = x, we have, defining L, = P,L,P,, 
L,X,T = y, +0(7). (47) 
The matrix L, is related to Ly,, the restriction of L, to M,, the only difference 


being that while Ly, is defined only on M,, the matrix L, is defined on all of E” 
but in such a way that it agrees with Ly, on M, and is zero on M;. The matrix L, 
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is not invertible, but for y, € M, there is a unique solution z € M, to the equation 
_ ; azo ; : : ; 
L,z= y; which we denote’ L, y,. With this notation we obtain from (47) 


: ae 
x,T =L, y,+o0(7). (48) 


Substituting this last result into (42) and accounting for |y,| = O(T) (see (43)) we 
have 


— 
A) = 5¥ELy Yet O(T?). (49) 


which expresses the objective value at x, in terms of the projected gradient. 
Since |x,| = 1 and since L, > L* as x, > x*, we see from (47) that 


o(T) +aT < |ly,| < AT + 0(7), (50) 


which means that not only do we have |y,| = O(7), which was known before, but 
also |y,| 4 0(T). We may therefore write our estimate (49) in the alternate form 


Mle pe Te 
fos.) = 5¥il'y (14 am), (51) 
yily ¥ 


and since 0(T”) 4 y'L, Y; = O(T’), we have 


f(&) = a Le 'y,(1 + O(7)), (52) 


which is the desired estimate. 

Next, we estimate f(x,,,) in terms of f(x,). Given x, now let x(r), t > 0, be 
the normalized geodesic emanating from x, = x(0) in the direction of the negative 
projected gradient, that is, 


x(0) = x, = —y;/lYxl- 


Then 
2 
ACK(0)) = fs) + 19% + FL + Of). (53) 


This is minimized at 


Yi x; 


xi; X; 


+ o(t,). (54) 


i= 


7 : a , zt yi 
* Actually a more standard procedure is to define the pseudoinverse L,, and then z = L,y,;. 
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In view of (50) this implies that t, = O(T), t, 4 o(T). Thus t, goes to zero at 
essentially the same rate as T. Thus we have 

t (yi Xs)” 
2% 1% 


Fe 41) = f&) — +0(T°). (55) 


Using the same argument as before we can express this as 


ieee? 
F(%) — fi) = 2 ae 
Khaya 


(1+ 0(7)), (56) 


which is the other required estimate. 
Finally, dividing (56) by (52) we find 


S(&) — f(%i41) — (yiy,)° + O(7)) 


= (57) 
——1 
FO) (viLay.) (VEE, Ye) 
and thus 
(Ye¥n)"U+ O(7)) 
f®iu) = : ; - | S(X&)- (58) 
(yLiyd(yiLy yi) 
Using the fact that L, — L* and applying the Kantorovich inequality leads to 
A=a\" 
f%1) < hag + O(T) | f(x,). (59) 


Problems with Inequalities 


The idealized version of gradient projection could easily be extended to problems 
having nonlinear inequalities as well as equalities by following the pattern of 
Section 12.4. Such an extension, however, has no real value, since the idealized 
scheme cannot be implemented. The idealized procedure was devised only as a 
technique for analyzing the asymptotic rate of convergence of the analytically more 
complex, but more practical, gradient projection method. 

The analysis of the idealized version of gradient projection given above, never- 
theless, does apply to problems having inequality as well as equality constraints. If 
a computationally feasible procedure is employed that avoids jamming and does not 
bounce on and off constraint boundaries an infinite number of times, then near the 
solution the active constraints will remain fixed. This means that near the solution 
the method acts just as if it were solving a problem having the active constraints 
as equality constraints. Thus the asymptotic rate of convergence of the gradient 
projection method applied to a problem with inequalities is also given by (59) but 
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with L(x*) and M(x*) (and hence a and A) determined by the active constraints 
at the solution point x*. In every case, therefore, the rate of convergence is deter- 
mined by the eigenvalues of the same restricted Hessian that arises in the necessary 
conditions. 


12.6 THE REDUCED GRADIENT METHOD 


From a computational viewpoint, the reduced gradient method, discussed in this 
section and the next, is closely related to the simplex method of linear programming 
in that the problem variables are partitioned into basic and nonbasic groups. From 
a theoretical viewpoint, the method can be shown to behave very much like the 
gradient projection method. 


Linear Constraints 


Consider the problem 


minimize f(x) 
(60) 
subject to Ax=b, x>0, 


where x € E”, be E”, A is mx n, and f is a function in C?. The constraints are 
expressed in the format of the standard form of linear programming. For simplicity 
of notation it is assumed that each variable is required to be non-negative—if 
some variables were free, the procedure (but not the notation) would be somewhat 
simplified. 

We invoke the nondegeneracy assumptions that every collection of m columns 
from A is linearly independent and every basic solution to the constraints has m 
strictly positive variables. With these assumptions any feasible solution will have 
at most n—m variables taking the value zero. Given a vector x satisfying the 
constraints, we partition the variables into two groups: x = (y,z) where y has 
dimension m and z has dimension n — m. This partition is formed in such a way 
that all variables in y are strictly positive (for simplicity of notation we indicate the 
basic variables as being the first m components of x but, of course, in general this 
will not be so). With respect to the partition, the original problem can be expressed 
as 


minimize f(y, z) (61a) 
subject to By+Cz=b (61b) 
y20, z>0, (61c) 


where, of course, A = [B, C]. We can regard z as consisting of the independent 
variables and y the dependent variables, since if z is specified, (61b) can be uniquely 
solved for y. Furthermore, a small change Az from the original value that leaves 
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z+ Az nonnegative will, upon solution of (61b), yield another feasible solution, 
since y was originally taken to be strictly positive and thus y+ Ay will also be 
positive for small Ay. We may therefore move from one feasible solution to another 
by selecting a Az and moving z on the line z+ a@Az, a > 0. Accordingly, y will move 
along a corresponding line y + aAy. If in moving this way some variable becomes 
zero, a new inequality constraint becomes active. If some independent variable 
becomes zero, a new direction Az must be chosen. If a dependent (basic) variable 
becomes zero, the partition must be modified. The zero-valued basic variable is 
declared independent and one of the strictly positive independent variables is made 
dependent. Operationally, this interchange will be associated with a pivot operation. 

The idea of the reduced gradient method is to consider, at each stage, 
the problem only in terms of the independent variables. Since the vector of 
dependent variables y is determined through the constraints (61b) from the vector 
of independent variables z, the objective function can be considered to be a function 
of z only. Hence a simple modification of steepest descent, accounting for the 
constraints, can be executed. The gradient with respect to the independent variables 
z (the reduced gradient) is found by evaluating the gradient of f(B-'b—B~!Cz, z). 
It is equal to 


al = V.S(Y, Z) ~ V,J(y; z)B"'C. (62) 


It is easy to see that a point (y, z) satisfies the first-order necessary conditions for 
optimality if and only if 


r, = 0 for all z; > 0 


r, > 0 for all z; = 0. 


In the active set form of the reduced gradient method the vector z is moved in 
the direction of the reduced gradient on the working surface. Thus at each step, a 
direction of the form 


__jrmig¢ W(z) 
Be =| 0, ic Wz) 


is determined and a descent is made in this direction. The working set is augmented 
whenever a new variable reaches zero; if it is a basic variable, a new partition 
is also formed. If a point is found where r; = 0 for all i ¢ W(z) (representing a 
vanishing reduced gradient on the working surface) but r; < 0 for some j € W(z), 
then j is deleted from W(z) as in the standard active set strategy. 

It is possible to avoid the pure active set strategy by moving away from our 
active constraint whenever that would lead to an improvement, rather than waiting 
until an exact minimum on the working surface is found. Indeed, this type of 
procedure is often used in practice. One version progresses by moving the vector 
z in the direction of the overall negative reduced gradient, except that zero-valued 
components of z that would thereby become negative are held at zero. One step of 
the procedure is as follows: 
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—r, if r,<Oorz>0 

ea 0 otherwise. 

2. If Az is zero, stop; the current point is a solution. Otherwise, find Ay = 
—B"'CaAz. 

3. Find a, a@,, a, achieving, respectively, 
max {a:y+aAy > 0} 
max {a:z+aAz>0} 
min {f(x+aAx):0< a<a,,0<a<a,} 

Let x = x+a,Ax. 

4. If a; < q,, return to (1). Otherwise, declare the vanishing variable in the 
dependent set independent and declare a strictly positive variable in the 
independent set dependent. Update B and C. 


Example. We consider the example presented in Section 12.4 where the projected 
negative gradient was computed: 


minimize x; +3 +x; +x; — 2x, — 3x, 
subject to 2x, +x,+%x3;+4x,=7 
X, +X) +2x3;+x,=6 
x,20, i=1,2,3,4. 
We are given the feasible point x = (2,2, 1,0). We may select any two of the 


strictly positive variables to be the basic variables. Suppose y = (x,, x,) is selected. 
In standard form the constraints are then 


x, +0 — x3+3x,=1 
O+ x, +3x;—2x,=5 
x, 20, 1=1,2,3,4. 


L 


The gradient at the current point is g = (2,4, 2, —3). The corresponding reduced 
gradient (with respect to z = (x3, x,)) is then found by pricing-out in the usual 
manner. The situation at the current point can then be summarized by the tableau 


Variable Hq | Hy | Xe | ey 
Constraints | I al et 2 I 
0) 1 3 —2 1/5 
r 0; 0}; -8 |] -1 
Current value 2 2 1 0 


Tableau for Example 


In this solution x, and x, would be increased together in a ratio of eight to one. 
As they increase, x, and x, would follow in such a way as to keep the constraints 
satisfied. Overall, in E*, the implied direction of movement is thus 


d = (5, —22, 8, 1). 
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If the reader carefully supplies the computational details not shown in the presen- 
tation of the example as worked here and in Section 12.4, he will undoubtedly 
develop a considerable appreciation for the relative simplicity of the reduced 
gradient method. 

It should be clear that the reduced gradient method can, as illustrated in the 
example above, be executed with the aid of a tableau. At each step the tableau 
of constraints is arranged so that an identity matrix appears over the m dependent 
variables, and thus the dependent variables can be easily calculated from the values 
of the independent variables. The reduced gradient at any step is calculated by 
evaluating the n-dimensional gradient and “pricing out” the dependent variables 
just as the reduced cost vector is calculated in linear programming. And when the 
partition of basic and non-basic variables must be changed, a simple pivot operation 
is all that is required. 


Global Convergence 


The perceptive reader will note the direction finding algorithm that results from the 
second form of the reduced gradient method is not closed, since slight movement 
away from the boundary of an inequality constraint can cause a sudden change in 
the direction of search. Thus one might suspect, and correctly so, that this method 
is subject to jamming. However, a trivial modification will yield a closed mapping; 
and hence global convergence. This is discussed in Exercise 19. 


Nonlinear Constraints 


The generalized reduced gradient method solves nonlinear programming problems 
in the standard form 


minimize f(x) 


subject to h(x) =0, ax<x<b, 


where h(x) is of dimension m. A general nonlinear programming problem can 
always be expressed in this form by the introduction of slack variables, if required, 
and by allowing some components of a and b to take on the values +-co or — oo, 
if necessary. 

In a manner quite analogous to that of the case of linear constraints, we 
introduce a nondegeneracy assumption that, at each point x, hypothesizes the 
existence of a partition of x into x = (y, z) having the following properties: 


i) y is of dimension m, and z is of dimension n— m. 
ii) If a= (a,,a,) and b= (b,,b,) are the corresponding partitions of a, b, then 
a, <y<b,. 
iii) The m x m matrix V,h(y, z) is nonsingular at x = (y, Z). 
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Again y and z are referred to as the vectors of dependent and independent 
variables, respectively. 
The reduced gradient (with respect to z) is in this case: 


r’ = V,f(y.z) +X’ V,h(y, z), 
where AJ satisfies 
T 
V f(y, zZ) + V,h(y, z) = 0. 
Equivalently, we have 


a = V.S(Y, Z) = Vy Sly, z)[V,h(y, z)| 'V,h(y, Z). (63) 


The actual procedure is roughly the same as for linear constraints in that moves 
are taken by changing z in the direction of the negative reduced gradient (with 
components of z on their boundary held fixed if the movement would violate the 
bound). The difference here is that although z moves along a straight line as before, 
the vector of dependent variables y must move nonlinearly to continuously satisfy 
the equality constraints. Computationally, this is accomplished by first moving 
linearly along the tangent to the surface defined by z > z+ Az, y > y+ Ay with 
Ay = —[V,h]-'V,hAz. Then a correction procedure, much like that employed in 
the gradient projection method, is used to return to the constraint surface and 
the magnitude bounds on the dependent variables are checked for feasibility. As 
with the gradient projection method, a feasibility tolerance must be introduced to 
acknowledge the impossibility of returning exactly to the constraint surface. An 
example corresponding to n = 3, m= 1,a=0, b = +00 is shown in Fig. 12.10. 

To return to the surface once a tentative move along the tangent is made, an 
iterative scheme is employed. If the point x, was the point at the previous step, 
then from any point x = (v, w) near x, one gets back to the constraint surface by 
solving the nonlinear equation 


h(y, w) =0 (64) 
for y (with w fixed). This is accomplished through the iterative process 


Yin =¥;—[Vyh(x,)]“h(y;, w), (65) 


which, if started close enough to x,, will produce {y;} with y; > y, solving (64). 

The reduced gradient method suffers from the same basic difficulties as the 
gradient projection method, but as with the latter method, these difficulties can 
all be more or less successfully resolved. Computation is somewhat less complex 
in the case of the reduced gradient method, because rather than compute with 
[Vh(x) Vh(x)"]~! at each step, the matrix [V,h(y, z)]~' is used. 
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Fig. 12.10 Reduced gradient method 


12.7 CONVERGENCE RATE OF THE REDUCED 
GRADIENT METHOD 


As argued before, for purposes of analyzing the rate of convergence, it is sufficient 
to consider the problem having only equality constraints 


minimize f(x) 
(66) 
subject to h(x) =0. 


We then regard the problem as being defined over a surface © of dimension n— m. 
At this point it is again timely to consider the view of our bug, who lives on 
this constraint surface. Invariably, he continues to regard the problem as extremely 
elementary, and indeed would have little appreciation for the complexity that seems 
to face us. To him the problem is an unconstrained problem in n — m dimensions 
not, as we see it, a constrained problem in n dimensions. The bug will tenaciously 
hold to the method of steepest descent. We can emulate him provided that we know 
how he measures distance on his surface and thus how he calculates gradients and 
what he considers to be straight lines. 

Rather than imagine that the measure of distance on his surface is the one that 
would be inherited from us in n dimensions, as we did when studying the gradient 
projection method, we, in this instance, follow the construction shown in Fig. 12.11. 
In our n-dimensional space, n — m coordinates are selected as independent variables 
in such a way that, given their values, the values of the remaining (dependent) 
variables are determined by the surface. There is already a coordinate system in 
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Fig. 12.11 Induced coordinate system 


the space of independent variables, and it can be used on the surface by projecting 
it parallel to the space of the remaining dependent variables. Thus, an arc on the 
surface is considered to be straight if its projection onto the space of independent 
variables is a segment of a straight line. With this method for inducing a geometry 
on the surface, the bug’s notion of steepest descent exactly coincides with an 
idealized version of the reduced gradient method. 

In the idealized version of the reduced gradient method for solving (66), the 
vector x is partitioned as x = (y, z) where y € E”,z€ E”". It is assumed that the 
mx m matrix V,h(y, z) is nonsingular throughout a given region of interest. (With 
respect to the more general problem, this region is a small neighborhood around the 
solution where it is not necessary to change the partition.) The vector y is regarded 
as an implicit function of z through the equation 


h(y(z),z) =0. (67) 


The ordinary method of steepest descent is then applied to the function g(z) = 
f(y(z), z). We note that the gradient r’ of this function is given by (63). 

Since the method is really just the ordinary method of steepest descent with 
respect to z, the rate of convergence is determined by the eigenvalues of the Hessian 
of the function g at the solution. We therefore turn to the question of evaluating 
this Hessian. 

Denote by Y(z) the first derivatives of the implicit function y(z), that is, 
Y(z) = V,y(z). Explicitly, 


Y(z) = —[V,h(y, z)]"'V_h(y, z). (68) 
For any \ € E” we have 
q(2) = f(y(z), 2) = f(y@),z) +" h(y(2), 2). (69) 
Thus 


Va(z) = [Vy fly. z) +N! Vyh(y, z)]¥(z) + V,fly.z) +d! V,hLy, 2). (70) 
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Now if at a given point x* = (y*, z*) = (y(z*), z*), we let X satisfy 
* * T * * . 
Vi f(y". 2) +N V,h(y*, z*) = 0; (71) 


then introducing the Lagrangian /(y,z,) = f(y,z) +A‘h(y,z), we obtain by 
differentiating (70) 


V' g(a?) = ¥ (a) Vi Uy", 2)¥ (2) + Vly", ZY") 
+Y(z")? Vi Uy", Zz )+ Villy", Zz"). (72) 


Or defining the n x (n— m) matrix 


i ] (73) 


where I is the (n — m) x (n—m) identity, we have 
Q= Vq(z") =C'L(x*)C. (74) 


The matrix L(x*) is the n x n Hessian of the Lagrangian at x*, and V°q(z*) is an 
(n—m) x (n—™m) matrix that is a restriction of L(x*) to the tangent subspace M, 
but it is not the usual restriction. We summarize our conclusion with the following 
theorem. 


Theorem. Let x* be a local solution of problem (66). Suppose that the 
idealized reduced gradient method produces a sequence {x,} converging to x* 
and that the partition x = (y,z) is used throughout the tail of the sequence. 
Let L be the Hessian of the Lagrangian at x* and define the matrix C by (73) 
and (68). Then the sequence of objective values {f(x,)} converges to f(x*) 
linearly with a ratio no greater than [(B— b)/(B+b)}. where b and B are, 
respectively, the smallest and largest eigenvalues of the matrix Q = C™LC. 


To compare the matrix C7LC with the usual restriction of L to M that deter- 
mines the convergence rate of most methods, we note that the n x (n— m) matrix 
C maps Az e E”~” into (Ay, Az) € E” lying in the tangent subspace M; that is, 
V,hAy + V,hAz = 0. Thus the columns of C form a basis for the subspace M. 
Next note that the columns of the matrix 


E=c(c’c)!” (75) 


form an orthonormal basis for M, since each column of E is just a linear combination 
of columns of C and by direct calculation we see that E7E =I. Thus by the 
procedure described in Section 11.6 we see that a representation for the usual 
restriction of L to M is 


Ly = (€7C)"!?C7LE(C"C)-. (76) 
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Comparing (76) with (74) we deduce that 
Q=(C'C)'7L,(C7C)'”. (77) 


This means that the Hessian matrix for the reduced gradient method is the restriction 
of L to M but pre- and post-multiplied by a positive definite symmetric matrix. 

The eigenvalues of Q depend on the exact nature of C as well as L,,. Thus, the 
rate of convergence of the reduced gradient method is not coordinate independent 
but depends strongly on just which variables are declared as independent at the 
final stage of the process. The convergence rate can be either faster or slower than 
that of the gradient projection method. In general, however, if C is well-behaved 
(that is, well-conditioned), the ratio of eigenvalues for the reduced gradient method 
can be expected to be the same order of magnitude as that of the gradient projection 
method. If, however, C should be ill-conditioned, as would arise in the case where 
the implicit equation h(y, z) = 0 is itself ill-conditioned, then it can be shown that 
the eigenvalue ratio for the reduced gradient method will most likely be considerably 
worsened. This suggests that care should be taken to select a set of basic variables 
y that leads to a well-behaved C matrix. 


Example. (The hanging chain problem). Consider again the hanging chain 
problem discussed in Section 11.4. This problem can be used to illustrate a wide 
assortment of theoretical principles and practical techniques. Indeed, a study of this 
example clearly reveals the predictive power that can be derived from an interplay 
of theory and physical intuition. 


The problem is 


minimize )>(n—i+0.5)y; 


i=1 


subjectto )°y,=0 
i=1 


YVvl-y= 16, 
i=l 


where in the original formulation n = 20. 
This problem has been solved numerically by the reduced gradient method.* 
An initial feasible solution was the triangular shape shown in Fig. 12.12(a) with 


_ [-0.6, 1<i<10 
%=) 06, 11<i< 20. 


*The exact solution is obviously symmetric about the center of the chain, and hence 
the problem could be reduced to having 10 links and only one constraint. However, this 
symmetry disappears if the first constraint value is specified as nonzero. Therefore for 
generality we solve the full chain problem. 
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(a) Original configuration of chain 


(b) Final configuration 


(c) Long chain 


Fig. 12.12 The chain example 
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Table 12.1 Results of original chain problem 


Iteration Value Solution (1/2 of chain) 
0 —60.00000 y, = —.8148260 
10 —66.47610 y2 = —.7826505 
20 —66.52180 y3 = —.7429208 
30 —66.53595 Y4 = —.6930959 
40 ~66.54154 ys = —.6310976 
50 —66.54537 Yo = —.5541078 
60 —66.54628 yz = —.4597160 
69 -66.54659 yg = —.3468334 
70 —66.54659 Yo = —.2169879 


Yio = —.07492541 
Lagrange multipliers —9.993817, —6.763148 


The results obtained from a reduced gradient package are shown in Table 12.1. 
Note that convergence is obtained in approximately 70 iterations. 

The Lagrange multipliers of the constraints are a by-product of the solution. 
These can be used to estimate the change in solution value if the constraint values 
are changed slightly. For example, suppose we wish to estimate, without resolving 
the problem, the change in potential energy (the objective function) that would result 
if the separation between the two supports were increased by, say, one inch. The 
change can be estimated by the formula A, = —\,/12 = 0.0833 x (6.76) = 0.563. 
(When solved again numerically the change is found to be 0.568.) 

Let us now pose some more challenging questions. Consider two variations of the 
original problem. In the first variation the chain is replaced by one having twice as many 
links, but each link is now half the size of the original links. The overall chain length is 
therefore the same as before. In the second variation the original chain is replaced by 
one having twice as many links, but each link is the same size as the original links. The 
chain length doubles in this case. If these problems are solved by the same method as 
the original problem, approximately how many iterations will be required—about the 
same number, many more, or substantially less? 

These questions can be easily answered by using the theory of convergence 
rates developed in this chapter. The Hessian of the Lagrangian is 


L=F+),H,+A,H). 


However, since the objective function and the first constraint are both linear, the 
only nonzero term in the above equation is \,H,. Furthermore, since convergence 
rates depend only on eigenvalue ratios, the \, can be ignored. Thus the eigenvalues 
of H, determine the canonical convergence rate. 

It is easily seen that H, is diagonal with ith diagonal term, 


(H,);, = —(—- ame 


and these values are the eigenvalues of H,. The canonical convergence rate is 
defined by the eigenvalues of H,, in the (n — 2)-dimensional tangent subspace M. 
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We cannot exactly determine these eigenvalues without a lot of work, but we can 
assume that they are close to the eigenvalues of H,,. (Indeed, a version of the 
Interlocking Eigenvalues Lemma states that the n — 2 eigenvalues are interlocked 
with the eigenvalues of H,,.) Then the convergence rate of the gradient projection 
method will be governed by these eigenvalues. The reduced gradient method will 
most likely be somewhat slower. 

The eigenvalue of smallest absolute value corresponds to the center links, where 
y,; ~ 0. Conversely, the eigenvalue of largest absolute value corresponds to the first 
or last link, where y; is largest in absolute value. Thus the relevant eigenvalue ratio 
is approximately 

1 1 


~ (—y23? (sin 3?” 


where @ is the angle shown in Fig. 12.12(b). 

For very little effort we have obtained a powerful understanding of the chain 
problem and its convergence properties. We can use this to answer the questions 
posed earlier. For the first variation, with twice as many links but each of half size, 
the angle 6 will be about the same (perhaps a little smaller because of increased 
flexibility of the chain). Thus the number of iterations should be slightly larger 
because of the increase in 9 and somewhat larger again because there are more 
variables (which tends to increase the condition number of C7C). Note in Table 12.2 
that about 122 iterations were required, which is consistent with this estimate. 

For the second variation the chain will hang more vertically; hence y, will 
be larger, and therefore convergence will be fundamentally slower. To be more 
specific it is necessary to substitute a few numbers in our simple formula. For the 
original case we have y, ~ —.81. This yields 


r=(1-.812)3? =4.9 


Table 12.2 Results of modified chain problems 


Short links Long chain 

Iteration Value Iteration Value 
0 —60.00000 0) —366.6061 
10 —66.45499 10 —375.6423 
20 —66.56377 20 —375.9123 
40 —66.58443 50 —376.5128 
60 —66.59191 100 —377.1625 
80 —66.59514 200 —377.8983 
100 —66.59656 500 —378.7989 
120 —66.59825 1000 —379.3012 
121 —66.59827 1500 —379.4994 
122 —66.59827 2000 —379.5965 
2500 —379.6489 


y, = 4109519 y, = .9886223 


394 Chapter 12 Primal Methods 


and a convergence factor of 


r-1\’ 
R= ~ .44. 
r+1 
This is a modest value and quite consistent with the observed result of 70 iterations 


for a reduced gradient method. For the long chain we can estimate that y, ~ .98. 
This yields 


r= (1—.987) 3? ~ 127 


pa" 
R= ~ .969. 
r+1 


This last number represents extremely slow convergence. Indeed, since (.969)”> = 
.44, we expect that it may easily take twenty-five times as many iterations for 
the long chain problem to converge as the original problem (although quantitative 
estimates of this type are rough at best). This again is verified by the results shown 
in Table 12.2, where it is indicated that over 2500 iterations were required by a 
version of the reduced gradient method. 


12.8 VARIATIONS 


It is possible to modify either the gradient projection method or the reduced gradient 
method so as to move in directions that are determined through additional consid- 
erations. For example, analogs of the conjugate gradient method, PARTAN, or any 
of the quasi-Newton methods can be applied to constrained problems by handling 
constraints through projection or reduction. The corresponding asymptotic rates 
of convergence for such methods are easily determined by applying the results 
for unconstrained problems on the (n — m)-dimensional surface of constraints, as 
illustrated in this chapter. 

Although such generalizations can sometimes lead to substantial improvement 
in convergence rates, one must recognize that the detailed logic for a complicated 
generalization can become lengthy. If the method relies on the use of an approximate 
inverse Hessian restricted to the constraint surface, there must be an effective 
procedure for updating the approximation when the iterative process progresses 
from one set of active constraints to another. One would also like to insure that the 
poor eigenvalue structure sometimes associated with quasi-Newton methods does 
not dominate the short-term convergence characteristics of the extended method 
when the active constraint set changes. In other words, one would like to be able to 
achieve simultaneously both superlinear convergence and a guarantee of fast single 
step progress. There has been some work in this general area and it appears to be 
one of potential promise. 
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*Convex Simplex Method 


A popular modification of the reduced gradient method, termed the convex simplex 
method, most closely parallels the highly effective simplex method for solving 
linear programs. The major difference between this method and the reduced gradient 
method is that instead of moving all (or several) of the independent variables in 
the direction of the negative reduced gradient, only one independent variable is 
changed at a time. The selection of the one independent variable to change is made 
much as in the ordinary simplex method. 

At a given feasible point, let x = (y,z) be the partition of x into dependent 
and independent parts, and assume for simplicity that the bounds on x are x > 0. 
Given the reduced gradient r’ at the current point, the component z; to be changed 
is found from: 


1. Let r,, = min {7;}. 

2. Let 2; = max {r,z;} 
If 7) = TZ = 0, terminate. Otherwise: 
If 7, < —|r2Z2|, increase z;, 
If r, > —|r2Z2|, decrease z;. 


The rule in Step 2 amounts to selecting the variable that yields the best potential 
decrease in the cost function. The rule accounts for the non-negativity constraint 
on the independent variables by weighting the cost coefficients of those variables 
that are candidates to be decreased by their distance from zero. This feature ensures 
global convergence of the method. 

The remaining details of the method are identical to those of the reduced 
gradient method. Once a particular component of z is selected for change, according 
to the above criterion, the corresponding y vector is computed as a function of 
the change in that component so as to continuously satisfy the constraints. The 
component of z is continuously changed until either a local minimum with respect 
to that component is attained or the boundary of one nonnegativity constraint is 
reached. 

Just as in the discussion of the reduced gradient method, it is convenient, 
for purposes of convergence analysis, to view the problem as unconstrained with 
respect to the independent variables. The convex simplex method is then seen to 
be a coordinate descent procedure in the space of these n—m variables. Indeed, 
since the component selected is based on the magnitude of the components of 
the reduced gradient, the method is merely an adaptation of the Gauss-Southwell 
scheme discussed in Section 8.9 to the constrained situation. Hence, although it is 
difficult to pin down precisely, we expect that it would take approximately n —m 
steps of this coordinate descent method to make the progress of a single reduced 
gradient step. To be competitive with the reduced gradient method; therefore, the 
difficulties associated with a single step—line searching and constraint evaluation— 
must be approximately n—m times simpler when only a single component is 
varied than when all n — m are varied simultaneously. This is indeed the case for 
linear programs and for some quadratic programs but not for nonlinear problems 
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that require the full line search machinery. Hence, in general, the convex simplex 
method may not be a bargain. 


12.9 SUMMARY 


The concept of feasible direction methods is a straightforward and logical extension 
of the methods used for unconstrained problems but leads to some subtle difficulties. 
These methods are susceptible to jamming (lack of global convergence) because many 
simple direction finding mappings and the usual line search mapping are not closed. 

Problems with inequality constraints can be approached with an active set 
strategy. In this approach certain constraints are treated as active and the others 
are treated as inactive. By systematically adding and dropping constraints from 
the working set, the correct set of active constraints is determined during the 
search process. In general, however, an active set method may require that several 
constrained problems be solved exactly. 

The most practical primal methods are the gradient projection methods and the 
reduced gradient method. Both of these basic methods can be regarded as the method 
of steepest descent applied on the surface defined by the active constraints. The rate 
of convergence for the two methods can be expected to be approximately equal and 
is determined by the eigenvalues of the Hessian of the Lagrangian restricted to the 
subspace tangent to the active constraints. Of the two methods, the reduced gradient 
method seems to be best. It can be easily modified to ensure against jamming and 
it requires fewer computations per iterative step and therefore, for most problems, 
will probably converge in less time than the gradient projection method. 


12.10 EXERCISES 


1. Show that the problem of finding d = (d;, d5,...,d,) to 


minimize c’d 


subject to Ad <0 
VL l4il =1 
i=l 


can be converted to a linear program. 


2. Sometimes a different normalizing term is used in (4). Show that the problem of finding 
d= (d,,d),...,d,) to 


minimize e’d 
subject to Ad <0 


max |d,| = | 


can be converted to a linear program. 
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3. Perhaps the most natural normalizing term to use in (4) is one based on the Euclidean 
norm. This leads to the problem of finding d = (d,, d5,...,d,,) to 


minimize c’d 
subject to Ad<0 
Sa =i, 
i=l 
Find the Karush-Kuhn—Tucker necessary conditions for this problem and show how 


they can be solved by a modification of the simplex procedure. 


4. Let OC E” be a given feasible region. A set [ C E?” consisting of pairs (x, d), with 
x € Q and d a feasible direction at x, is said to be a set of uniformly feasible direction 
vectors if there is a 6 > 0 such that (x,d) € T implies that x+ ad is feasible for all 
a,0<a<6. The number 6 is referred to as the feasibility constant of the set I. 

Let Tc E" be a set of uniformly feasible direction vectors for ©, with feasibility 
constant 6. Define the mapping 


M;(x, d) = {y: f(y) < f(x+7d) for all 7,0 <7 < 6; y=x+ad, 
for some a,0 <a<ow,ye O}. 


Show that if d 4 0, the map M; is closed at (x, d). 


5. Let [ Cc E*" be a set of uniformly feasible direction vectors for Q with feasibility 
constant 6. For ¢ > 0 define the map °M; or I’ by 


°M;(x, d) = {y: f(y) < f(x+7d)+¢ for all 7,0 << 7 < 6; y=x+ad, 
for some a,0< agaw,yeO}. 


The map “Ms corresponds to an “inaccurate” constrained line search. Show that this 
map is closed if d 4 0. 


6. For the problem 


minimize f(x) 


subject to ajx <b, 
ax <b, 
aX < Dns 
consider selecting d = (d,, d>,..., d,,) at a feasible point x by solving the problem 


minimize WV/f(x)d 


subject to. a)d <(b;—a/x)M, i=1,2,...,m 


dV I4il =1, 
i=1 
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where M is some given positive constant. For large M the ith inequality of this 
subsidiary problem will be active only if the corresponding inequality in the original 
problem is nearly active at x (indeed, note that M — oo corresponds to Zoutendijk’s 
method). Show that this direction finding mapping is closed and generates uniformly 
feasible directions with feasibility constant 1/M. 


7. Generalize the method of Exercise 6 so that it is applicable to nonlinear inequalities. 
8. An alternate, but equivalent, definition of the projected gradient p is that it is the vector 
solving 
minimize |g—p|? 
subject to A,p=0. 
Using the Karush-Kuhn—Tucker necessary conditions, solve this problem and thereby 
derive the formula for the projected gradient. 


9. Show that finding the d that solves 


minimize g’d 
subject to A,d=0, \d|/?=1 


gives a vector d that has the same direction as the negative projected gradient. 
10. Let P be a projection matrix. Show that P? = P, P? = P. 


11. Suppose A, = [a’, A,] so that A, is the matrix A; with the row a’ adjoined. Show that 
(A,A7)~! can be found from (A,A7)~' from the formula 


(AAT = fe —ea™AT(A,AZ)"! 

= T= T)-1 TAT T)-1) |> 
v4 —&(A,A;) A;a (A;A; ) [I+ A;aa A; (A; A; ) ] 
where 


1 
E= ; 
a’a—al™AT(A;AZ)-!Aza 


Develop a similar formula for (A,A;)! in terms of (A,A,)'. 


12. Show that the gradient projection method will solve a linear program in a finite number 
of steps. 


13. Suppose that the projected negative gradient d is calculated satisfying 
—g=d+Ajn 


and that some component A, of A, corresponding to an inequality, is negative. Show 
that if the ith inequality is dropped, the projection d; of the negative gradient onto the 
remaining constraints is a feasible direction of descent. 


14. Using the result of Exercise 13, it is possible to avoid the discontinuity at d = 0 in the 
direction finding mapping of the simple gradient projection method. At a given point let 


15. 


16. 


17. 


18. 


19. 


20. 
ZA, 
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y = —min {0,),}, with the minimum taken with respect to the indices i corresponding 
the active inequalities. The direction to be taken at this point is d = —Pg if |Pg| > y, 
or d, defined by dropping the inequality i for which \, = —y, if |Pg| < y. (In case of 
equality either direction is selected.) Show that this direction finding map is closed over 
a region where the set of active inequalities does not change. 


Consider the problem of maximizing entropy discussed in Example 3, Section 14.4. 
Suppose this problem were solved numerically with two constraints by the gradient 
projection method. Derive an estimate for the rate of convergence in terms of the 
optimal p,’s. 


Find the geodesics of 


a) a two-dimensional plane 
b) a sphere. 


Suppose that the problem 


minimize f(x) 
subject to h(x)=0 
is such that every point is a regular point. And suppose that the sequence of points 


{x,}?) generated by geodesic descent is bounded. Prove that every limit point of the 
sequence satisfies the first-order necessary conditions for a constrained minimum. 


Show that, for linear constraints, if at some point in the reduced gradient method Az is 
zero, that point satisfies the Karush-Kuhn-Tucker first-order necessary conditions for a 
constrained minimum. 


Consider the problem 


minimize f(x) 
subject to Ax=b 


x>0, 


where A is mxn. Assume f € C ' that the feasible set is bounded, and that the 
nondegeneracy assumption holds. Suppose a “modified” reduced gradient algorithm is 
defined following the procedure in Section 12.6 but with two modifications: (i) the basic 
variables are, at the beginning of an iteration, always taken as the m largest variables 
(ties are broken arbitrarily); (ii) the formula for Az is replaced by 


az =| -ry if r,<0 


: —x7r, ifr, > 0. 
Establish the global convergence of this algorithm. 


Find the exact solution to the example presented in Section 12.4. 


Find the direction of movement that would be taken by the gradient projection method 
if in the example of Section 12.4 the constraint x, = 0 were relaxed. Show that if the 
term —3x, in the objective function were replaced by —x,, then both the gradient 
projection method and the reduced gradient method would move in identical directions. 
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22. Show that in terms of convergence characteristics, the reduced gradient method behaves 
like the gradient projection method applied to a scaled version of the problem. 


23. Let r be the condition number of Ly, and s the condition number of CC. Show that the 
rate of convergence of the reduced gradient method is no worse than [(sr — 1)/(sr +1). 


24. Formulate the symmetric version of the hanging chain problem using a single constraint. 
Find an explicit expression for the condition number of the corresponding C’C matrix 
(assuming y, is basic). Use Exercise 23 to obtain an estimate of the convergence 
rate of the reduced gradient method applied to this problem, and compare it with the 
rate obtained in Table 12.1, Section 12.7. Repeat for the two-constraint formulation 
(assuming y, and y, are basic). 


25. Referring to Exercise 19 establish a global convergence result for the convex simplex 
method. 
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Chapter13 PENALTY 
AND BARRIER 
METHODS 


Penalty and barrier methods are procedures for approximating constrained 
optimization problems by unconstrained problems. The approximation is accom- 
plished in the case of penalty methods by adding to the objective function a term 
that prescribes a high cost for violation of the constraints, and in the case of barrier 
methods by adding a term that favors points interior to the feasible region over 
those near the boundary. Associated with these methods is a parameter c or pw that 
determines the severity of the penalty or barrier and consequently the degree to 
which the unconstrained problem approximates the original constrained problem. 
For a problem with 7 variables and m constraints, penalty and barrier methods work 
directly in the n-dimensional space of variables, as compared to primal methods 
that work in (n — m)-dimensional space. 

There are two fundamental issues associated with the methods of this chapter. 
The first has to do with how well the unconstrained problem approximates the 
constrained one. This is essential in examining whether, as the parameter c is 
increased toward infinity, the solution of the unconstrained problem converges 
to a solution of the constrained problem. The other issue, most important from 
a practical viewpoint, is the question of how to solve a given unconstrained 
problem when its objective function contains a penalty or barrier term. It turns out 
that as c is increased to yield a good approximating problem, the corresponding 
structure of the resulting unconstrained problem becomes increasingly unfavorable 
thereby slowing the convergence rate of many algorithms that might be applied. 
(Exact penalty functions also have a very unfavorable structure.) It is necessary, 
then, to devise acceleration procedures that circumvent this slow convergence 
phenomenon. 

Penalty and barrier methods are of great interest to both the practitioner and the 
theorist. To the practitioner they offer a simple straightforward method for handling 
constrained problems that can be implemented without sophisticated computer 
programming and that possess much the same degree of generality as primal 
methods. The theorist, striving to make this approach practical by overcoming its 
inherently slow convergence, finds it appropriate to bring into play nearly all aspects 


401 


402 Chapter 13 Penalty and Barrier Methods 


of optimization theory; including Lagrange multipliers, necessary conditions, and 
many of the algorithms discussed earlier in this book. The canonical rate of conver- 
gence associated with the original constrained problem again asserts its fundamental 
role by essentially determining the natural accelerated rate of convergence for 
unconstrained penalty or barrier problems. 


13.1 PENALTY METHODS 


Consider the problem 


minimize f(x) 


(1) 


subject to xe S, 


where f is a continuous function on E” and S is a constraint set in E”. In most 
applications S is defined implicitly by a number of functional constraints, but in this 
section the more general description in (1) can be handled. The idea of a penalty 
function method is to replace problem (1) by an unconstrained problem of the form 


minimize f(x)+cP(x), (2) 


where c is a positive constant and P is a function on E” satisfying: (i) P is 
continuous, (ii) P(x) > 0 for all x € E”, and (iii) P(x) = 0 if and only if x € S. 


Example 1. Suppose S is defined by a number of inequality constraints: 


S={x:g(x)<0, i=1,2,...,p}. 
A very useful penalty function in this case is 
ce 2 
PO) = 9 2 (max eG) 


The function cP(x) is illustrated in Fig. 13.1 for the one-dimensional case with 
81(x) =x— Dd, g(x) =a—x. 

For large c it is clear that the minimum point of problem (2) will be in a 
region where P is small. Thus, for increasing c it is expected that the corresponding 
solution points will approach the feasible region S and, subject to being close, will 
minimize f. Ideally then, as c > oo the solution point of the penalty problem will 
converge to a solution of the constrained problem. 
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cP (x) 
A 


> xX 

Fig. 13.1 Plot of cP(x) 
The Method 
The procedure for solving problem (1) by the penalty function method is this: 
Let {c,},k = 1,2,..., be a sequence tending to infinity such that for each 
k,c, => 0, c.4, > cy. Define the function 

q(c, x) = f(x) + cP(x). (3) 
For each k solve the problem 

minimize 4q(c;,X), (4) 


obtaining a solution point x,. 

We assume here that, for each k, problem (4) has a solution. This will be true, 
for example, if g(c, x) increases unboundedly as |x| — oo. (Also see Exercise 2 to 
see that it is not necessary to obtain the minimum precisely.) 


Convergence 


The following lemma gives a set of inequalities that follow directly from the 
definition of x, and the inequality c,,,; > c,. 


Lemma 1. 
Cys Xe) S Cpr Xe) (5) 
P(X) > P&41) (6) 
S(&) < fy). (7) 
Proof. 


UCyprs Xp) = f&ep1) + Ce PR) S LK) + PR) 
> f(%) +o P(R) = O(Cy X), 
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which proves (5). 
We also have 


F(X) + PR) < F%e1) + PK) (8) 
SK) + Cer P41) < fA) + Ces PO). (9) 
Adding (8) and (9) yields 
(Cyr — Cp) P41) S (Cee — 1) PO), 


which proves (6). 
Also 


SR) + CPR) S A%) + Oe PR), 
and hence using (6) we obtain (7). 


Lemma 2. Let x* be a solution to problem (1). Then for each k 


F(R") S Qc, X&) = F(%)- 


Proof. 


F(R) = FR) + POR) > AR) + POR) > F&)-H 


Global convergence of the penalty method, or more precisely verification that 
any limit point of the sequence is a solution, follows easily from the two lemmas 
above. 


Theorem. Let {x,} be a sequence generated by the penalty method. Then, any 
limit point of the sequence is a solution to (1). 


Proof. Suppose the subsequence {x,}, k € K is a convergent subsequence of {x,} 
having limit x. Then by the continuity of f, we have 


limit f(x.) = f(X). (10) 
Let f* be the optimal value associated with problem (1). Then according to 


Lemmas | and 2, the sequence of values q(c,, x,) is nondecreasing and bounded 
above by f*. Thus 


limit q(c,.%,) = 4° < f*. (11) 
Subtracting (10) from (11) yields 


limit ¢, P(x,) = q* — f(X). (12) 
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Since P(x,) > 0 and c, — oo, (12) implies 
limit P(x,) =0. 


Using the continuity of P, this implies P(x) = 0. We therefore have shown that the 
limit point x is feasible for (1). 
To show that x is optimal we note that from Lemma 2, f(x,) < f* and hence 


A(X) = limityeg f(&) < fH 


13.2 BARRIER METHODS 


Barrier methods are applicable to problems of the form 


minimize f(x) 


(13) 


subject to xe S, 


where the constraint set S has a nonempty interior that is arbitrarily close to any 
point of S. Intuitively, what this means is that the set has an interior and it is 
possible to get to any boundary point by approaching it from the interior. We shall 
refer to such a set as robust. Some examples of robust and nonrobust sets are shown 
in Fig. 13.2. This kind of set often arises in conjunction with inequality constraints, 
where S takes the form 


S={x:g(x)<0, i=1,2,...,p} 


Barrier methods are also termed interior methods. They work by establishing 
a barrier on the boundary of the feasible region that prevents a search procedure 
from leaving the region. A barrier function is a function B defined on the interior 
of S such that: (i) B is continuous, (ii) B(x) > 0, (iii) B(x) — oo as x approaches 
the boundary of S. 


Example 1. Let g,,i=1,2,..., p be continuous functions on E”. Suppose 
S={x:g(x)<0, i=1,2,...,p}. 


is robust, and suppose the interior of S is the set of x’s where g,(x) < 0,i= 
1,2,..., p. Then the function 


ie | 
B(x) = 2 ae 


defined on the interior of S, is a barrier function. It is illustrated in one dimension 
for g, =x—a, g,=x-—b in Fig. 13.3. 
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e 7 
e 
Robust Not robust Not robust 


Fig. 13.2 Examples 


Example 2. For the same situation as Example 1, we may use the logarithmic 
utility function 


B(x) = =) los[—8,(8)] 


This is the barrier function commonly used in linear programming interior point 
methods, and it is frequently used more generally as well. 


Corresponding to the problem (13), consider the approximate problem 


minimize f(x)+ | B(x) 
c 


(14) 
subject to xe interior of S, 
where c is a positive constant. 
Alternatively, it is common to formulate the barrier method as 
minimize f(x) +mB(x) (15) 


subject to x interior of S. 


1B) 
Cc 


Fig. 13.3 Barrier function 
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When formulated with c we take c large (going to infinity); while when formulated 
with we take yz small (going to zero). Either way the result is a constrained 
problem, and indeed the constraint is somewhat more complicated than in the 
original problem (13). The advantage of this problem, however, is that it can be 
solved by using an unconstrained search technique. To find the solution one starts 
at an initial interior point and then searches from that point using steepest descent 
or some other iterative descent method applicable to unconstrained problems. Since 
the value of the objective function approaches infinity near the boundary of S, the 
search technique (if carefully implemented) will automatically remain within the 
interior of S, and the constraint need not be accounted for explicitly. Thus, although 
problem (14) or (15) is from a formal viewpoint a constrained problem, from a 
computational viewpoint it is unconstrained. 


The Method 

The barrier method is quite analogous to the penalty method. Let {c,} be a sequence 
tending to infinity such that for each k,k = 1,2,...,c, 20, cy, > c,. Define the 
function 


1 
r(c, X) = f(x) + Fokus 
For each k solve the problem 


minimize r(c;, x) 


subject to xe interior of S, 


obtaining the point x,. 


Convergence 


Virtually the same convergence properties hold for the barrier method as for the 
penalty method. We leave to the reader the proof of the following result. 


Theorem. Any limit point of a sequence {x,} generated by the barrier method 
is a solution to problem (13). 


13.3. PROPERTIES OF PENALTY AND BARRIER 
FUNCTIONS 


Penalty and barrier methods are applicable to nonlinear programming problems 
having a very general form of constraint set S. In most situations, however, this set 
is not given explicitly but is defined implicitly by a number of functional constraints. 
In these situations, the penalty or barrier function is invariably defined in terms of 
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the constraint functions themselves; and although there are an unlimited number of 
ways in which this can be done, some important general implications follow from 
this kind of construction. 

For economy of notation we consider problems of the form 


minimize f(x) 


(16) 
subject to g(x) <0, i=1,2,...,p. 


For our present purposes, equality constraints are suppressed, at least notationally, 
by writing each of them as two inequalities. If the problem is to be attacked with 
a barrier method, then, of course, equality constraints are not present even in an 
unsuppressed version. 


Penalty Functions 


A penalty function for a problem expressed in the form (16) will most naturally be 
expressed in terms of the auxiliary constraint functions 


g;* (x) = max [0, ¢,(x)], i=1,2,...,p. (17) 


This is because in the interior of the constraint region P(x) = 0 and hence P should 
be a function only of violated constraints. Denoting by gt (x) the p-dimensional 
vector made up of the g;*(x)’s, we consider the general class of penalty functions 


P(x) = ¥(8" (x), (18) 


where y is a continuous function from E? to the real numbers, defined in such a 
way that P satisfies the requirements demanded of a penalty function. 


Example 1. Set 


2 
’ 


P(x) = ¥ ai")? = !\g*(x) 


which is without doubt the most popular penalty function. In this case y is one-half 
times the identity quadratic form on E?, that is, y(y) = Slyl?. 


Example 2. By letting 
vy) =y'Ty, 
where I is a symmetric positive definite p x p matrix, we obtain the penalty function 


P(x) = g* (x)'Tg* (x). 
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Example 3. A general class of penalty functions is 
Pp 
P(x) = D2 (8;"(x))° 
i=l 


for some e > 0. 


Lagrange Multipliers 
In the penalty method we solve, for various c,, the unconstrained problem 
minimize f(x)+c,P(x). (19) 


Most algorithms require that the objective function has continuous first partial 
derivatives. Since we shall, as usual, assume that both f and ge C ' it is natural to 
require, then, that the penalty function P € C!. We define 


Veg,(x) if g(x) >0 

am = t i a 

and, of course, Vg* (x) is the m x n matrix whose rows are the Vg;’s. Unfortunately, 
Vg* is usually discontinuous at points where g; (x) =0 for some i= 1,2,..., p, 


and thus some restrictions must be placed on y in order to guarantee P € C'. We 
assume that y € C! and that if y = (y), y2,.--, ¥,), Vv(y) = (Vy, Vyo.---» V¥n)s 
then 


y,=0 implies Vy; =0. (21) 


(In Example 3 above, for instance, this condition is satisfied only for ¢ > 1.) With 
this assumption, the derivative of y(g*(x)) with respect to x is continuous and 
can be written as Vy(gt (x))Vg(x). In this result Vg(x) legitimately replaces the 
discontinuous Vg* (x), because it is premultiplied by Vy(g*(x)). Of course, these 
considerations are necessary only for inequality constraints. If equality constraints 
are treated directly, the situation is far simpler. 

In view of this assumption, problem (19) will have its solution at a point x, 
satisfying 


VA(K,) + & Vy(g* (x,)) Vg(x,) = 9, 
which can be written as 
Vi(x,) +A; Ve(x,) = 0, (22) 
where 


AI = 6, Vy(g*(x;)). (23) 


410 Chapter 13 Penalty and Barrier Methods 


Thus, associated with every c is a Lagrange multiplier vector that is determined 
after the unconstrained minimization is performed. 

If a solution x* to the original problem (16) is a regular point of the constraints, 
then there is a unique Lagrange multiplier vector X* associated with the solution. 
The result stated below says that A, — A*. 


Proposition. Suppose that the penalty function method is applied to problem 
(16) using a penalty function of the form (18) with y € C! and satisfying 
(21). Corresponding to the sequence {x,} generated by this method, define 
Ai =¢, Vy(g* (x;)). If x, > x*, a solution to (16), and this solution is a regular 
point, then A, — A*, the Lagrange multiplier associated with problem (16). 


Proof. Left to the reader. 


Example 4. For P(x) = $|g*(x)|? we have A, = c,g* (x). 


As a final observation we note that in general if x, — x*, then since A, = 
c.V y(gt (x,))’ > A*, the sequence x, approaches x* from outside the constraint 
region. Indeed, as x, approaches x* all constraints that are active at x* and have 
positive Lagrange multipliers will be violated at x, because the corresponding 
components of Vy(gt(x,)) are positive. Thus, if we assume that the active 
constraints are nondegenerate (all Lagrange multipliers are strictly positive), every 
active constraint will be approached from the outside. 


The Hessian Matrix 


Since the penalty function method must, for various (large) values of c, solve the 
unconstrained problem 


minimize f(x)+cP(x), (24) 


it is important, in order to evaluate the difficulty of such a problem, to determine 
the eigenvalue structure of the Hessian of this modified objective function. We 
show here that the structure becomes increasingly unfavorable as c increases. 

Although in this section we require that the function P € C!, we do not require 
that P € C?. In particular, the most popular penalty function P(x) = $|g* (x) on 
illustrated in Fig. 13.1 for one component, has a discontinuity in its second derivative 
at any point where a component of g is zero. At first this might appear to be a 
serious drawback, since it means the Hessian is discontinuous at the boundary of the 
constraint region—right where, in general, the solution is expected to lie. However, 
as pointed out above, the penalty method generates points that approach a boundary 
solution from outside the constraint region. Thus, except for some possible chance 
occurrences, the sequence will, as x, — x*, be at points where the Hessian is well- 
defined. Furthermore, in iteratively solving the unconstrained problem (24) with 
a fixed c,, a sequence will be generated that converges to x, which is (for most 
values of k) a point where the Hessian is well-defined, and hence the standard type 
of analysis will be applicable to the tail of such a sequence. 
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Defining g(c, x) = f(x) + cy(g*(x)) we have for the Hessian, Q, of g (with 
respect to x) 


Q(c, x) = F(x) + eV (gt (x))G(x) + cVg* (x)/T(g* (x)) Va" (x), 


where F, G, and I are, respectively, the Hessians of f, g, and y. For a fixed c, we 
use the definition of A, given by (23) and introduce the rather natural definition 


L(x) = F(x,) + A, G(x,), (25) 
which is the Hessian of the corresponding Lagrangian. Then we have 


Q(c, X,) = Ly (x) +o, Va" (x,)"T(gt (x,)) Vg" (x), (26) 


which is the desired expression. 

The first term on the right side of (26) converges to the Hessian of the 
Lagrangian of the original constrained problem as x, — x*, and hence has a limit 
that is independent of c,. The second term is a matrix having rank equal to the 
rank of the active constraints and having a magnitude tending to infinity. (See 
Exercise 7.) 


Example 5. For P(x) = $|g*(x)|? we have 


& O web 


rety=| > °C, 


Oo 
j=) 
mn . 


Pp 
where 
1 if g,(x,)>0 
e,= 0 if g;(x,) <0 
undefined if g;(x,) =0. 
Thus 


c,. Vet (x,)" (g* (x,)) Ve" (x,) = @ Vg" (x) "Vg" (x;), 


which is c, times a matrix that approaches Vg* (x*)? Vgt (x*). This matrix has rank 
equal to the rank of the active constraints at x* (refer to (20)). 


Assuming that there are r active constraints at the solution x*, then for well- 
behaved y, the Hessian matrix Q(c,, x,) has r eigenvalues that tend to infinity as 
C, — 06, arising from the second term on the right side of (26). There will be n—r 
other eigenvalues that, although varying with c,, tend to finite limits. These limits 
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turn out to be, as is perhaps not too surprising at this point, the eigenvalues of 
L(x*) restricted to the tangent subspace M of the active constraints. The proof of 
this requires some further analysis. 


Lemma 1. Let A(c) be a symmetric matrix written in partitioned form 


a= [80 Mo]: en 


where A,(c) tends to a positive definite matrix A,,A,(c) tends to a finite 
matrix, and A,(c) is a positive definite matrix tending to infinity with c (that 
is, for any s > 0, A3(c) — sI is positive definite for sufficiently large c). Then 


-1 
A"(c) > a 4 (28) 
as C—> ©, 
Proof. We have the identity 
Kea 
AZ A 
9, ENG (29) 
- (A, — A,AZ'AZ) —(A, — A,AZ'As)A, AG! 
~ | -A3'A3 (A, — A,AZ'AS)—! (A; — AZ A;'A5) 


Using the fact that Aj'(c) > 0 gives the result. 


To apply this result to the Hessian matrix (26) we associate A with Q(c;,, x;) 
and let the partition of A correspond to the partition of the space E” into the subspace 
M and the subspace WN that is orthogonal to M; that is, N is the subspace spanned 
by the gradients of the active constraints. In this partition, L,,, the restriction of L 
to M, corresponds to the matrix A,. 

We leave the details of the required continuity arguments to the reader. The 
important conclusion is that if x* is a solution to (16), is a regular point, and has 
exactly r active constraints none of which are degenerate, then the Hessian matrices 
Q(c,, X;,) of a penalty function of form (18) have r eigenvalues tending to infinity 
as c, —> oo, and n—r eigenvalues tending to the eigenvalues of Ly. 

This explicit characterization of the structure of penalty function Hessians is 
of great importance in the remainder of the chapter. The fundamental point is that 
virtually any choice of penalty function (within the class considered) leads both to 
an ill-conditioned Hessian and to consideration of the ubiquitous Hessian of the 
Lagrangian restricted to M. 


Barrier Functions 


Essentially the same story holds for barrier function. If we consider for Problem 
(16) barrier functions of the form 


B(x) = n(g(x)), (30) 
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then Lagrange multipliers and ill-conditioned Hessians are again inevitable. Rather 
than parallel the earlier analysis of penalty functions, we illustrate the conclusions 
with two examples. 


Example 1. Define 


P 1 
B(x) =) —-—-. (31) 
X g(x) 
The barrier objective 
12 
r(cy, X) = f(x) — 
‘ os a) 
has its minimum at a point x, satisfying 
Vita = oe 5 Vgi(Xi) = (32) 
Cy i= 18 oa 
1 
Thus, we define A, to be the vector having ith ay iP va . Then (32) 
Xi 


can be written as 
VK) +L Ve(x,) = 0 


Again, assuming x, — x*, the solution of (16), we can show that A, > A*, the 
Lagrange multiplier vector associated with the solution. This implies that if g, is an 
active constraint, 


1 


— nN . 33 
C48: (%,)* ~ — 9) 


L 


Next, evaluating the Hessian R(c,, x,) of r(c,, X;), we have 


R(e4.%) = Fos) +P Gil) = = Vag ay) Vea) 


Ck j= & k j=l 


= La) = = Vai0m)"Vei() 


Ck i=l 


AS c, — oo we have 


-1 oo if g, is active at x* 
Cp 8 (X,)3 0 if g, is inactive at x* 


so that we may write, from (33), 


\; 
R(cj, X,) > L(x) + 0-4 Vgi(%,)" Vei(x,), 
ie€l 8(X;) 
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where J is the set of indices corresponding to active constraints. Thus the Hessian 
of the barrier objective function has exactly the same structure as that of penalty 
objective functions. 


Example 2. Let us use the logarithmic barrier function 
B(x) = ~Ylog| —8i(x)] 
i=1 


In this case we will define the barrier objective in terms of jz as 
Pp 
r(x) = f(x) — w > jlog[—g;(x)] 
i=l 


The minimum point x,, satisfies 


0=Vi(x, +e Vgi(x,). (34) 


ae ; 
Defining 
Nu i= Vaan 
’ §i (x,,) 
(34) can be written as 
V(x, ) + ALVE(x,) =0 
Further we expect that A,, > A* as w — 0. 


The Hessian of r(w, x) is 


d, 
R(w,x,,) = Fx tL rnG (x 1+ DE Vala)” Was) 


Hence, for small py it has the same structure as that found in Example 1. 


The Central Path 


The definition of the central path associated with linear programs is easily extended 
to general nonlinear programs. For example, consider the problem 


minimize f(x) 
subject to h(x) = 
g(x) <0. 
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We assume that F= {x : h(x) =0, g(x) <0} 4 @. Then we use the logarithmic 
barrier function to define the problems 
minimize f(x)—p ae log[—g,(x)] 
subject to h(x) = 0. 
The solution x,, parameterized by 4 — 0 is the central path. 
The necessary conditions for the problem can be written as 
Vix.) +A VQ(x,) + W(x,) = 0 
h(x,) = 90. 
higi(x,)=—Ms 1=1,2,...,p 


where y is the Lagrange multiplier vector for the constraint h(x,,) = 0. 


Geometric Interpretation—The Primal Function 


There is a geometric construction that provides a simple interpretation of penalty 
functions. The basis of the construction itself is also useful in other areas of 
optimization, especially duality theory, as explained in the next chapter. 

Let us again consider the problem 


minimize f(x) 
subject to h(x) =0, (35) 


where h(x) € E”. We assume that the solution point x* of (35) is a regular point 
and that the second-order sufficiency conditions are satisfied. Corresponding to this 
problem we introduce the following definition: 


Definition. Corresponding to the constrained minimization problem (35), the 
primal function o is defined on E” in a neighborhood of 0 to be 


(y) = min{ f(x) : h(x) = y}. (36) 


The primal function gives the optimal value of the objective for various values of 
the right-hand side. In particular w(0) gives the value of the original problem. 
Strictly speaking the minimum in the definition (36) must be specified as a local 
minimum, in a neighborhood of x*. The existence of w(y) then follows directly 
from the Sensitivity Theorem in Section 11.7. Furthermore, from that theorem it 
follows that Vw(0) = —A*’. 
Now consider the penalty problem and note the following relations: 


min { f(x) + 5clh(x)|?} = min, ,{ f(x) + 5clyl? : h(x) = y} 


Aas (37) 
= min, {w(y) + 5cly|"}. 
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Fig. 13.4 The primal function 


This is illustrated in Fig. 13.4 for the case where y is one-dimensional. The primal 
function is the lowest curve in the figure. Its value at y= 0 is the value of the 
original constrained problem. Above the primal function are the curves w(y) + }cy? 
for various values of c. The value of the penalty problem is shown by (37) to be 
the minimum point of this curve. For large values of c this curve becomes convex 
near 0 even if w(y) is not convex. Viewed in this way, the penalty functions can 
be thought of as convexifying the primal. 

Also, as c increases, the associated minimum point moves toward 0. However, 
it is never zero for finite c. Furthermore, in general, the criterion for u to be optimal 
for the penalty problem is that the gradient of w(y) + 4cy* equals zero. This yields 
Vw(y) +cy? =0. Using Vo(y) =—A" and y = h(x,), where now x, denotes the 
minimum point of the penalty problem, gives X = ch(x.), which is the same as (23). 


13.4 NEWTON’S METHOD AND PENALTY 
FUNCTIONS 


In the next few sections we address the problem of efficiently solving the uncon- 
strained problems associated with a penalty or barrier method. The main difficulty 
is the extremely unfavorable eigenvalue structure that, as explained in Section 13.3, 
always accompanies unconstrained problems derived in this way. Certainly straight- 
forward application of the method of steepest descent is out of the question! 

One method for avoiding slow convergence for these problems is to apply 
Newton’s method (or one of its variations), since the order two convergence of 
Newton’s method is unaffected by the poor eigenvalue structure. In applying the 
method, however, special care must be devoted to the manner by which the Hessian 
is inverted, since it is ill-conditioned. Nevertheless, if second-order information 
is easily available, Newton’s method offers an extremely attractive and effective 
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method for solving modest size penalty or barrier optimization problems. When 
such information is not readily available, or if data handling and storage require- 
ments of Newton’s method are excessive, attention naturally focuses on first-order 
methods. 

A simple modified Newton’s method can often be quite effective for some 
penalty problems. For example, consider the problem having only equality 
constraints 


minimize f(x) 
(38) 
subject to h(x) =0 


with x € E”, h(x) € E”, m <n. Applying the standard quadratic penalty method 
we solve instead the unconstrained problem 


minimize f(x) + +c|h(x)|? (39) 


for some large c. Calling the penalty objective function g(x) we consider the 
iterative process 


Xe = X, — &% [I+ cVh(x,)' Vh(x,)}-'Vaq(x,)", (40) 


where a, is chosen to minimize g(x;,,,). The matrix I+ cVh(x,)’Vh(x,) is 
positive definite and although quite ill-conditioned it can be inverted efficiently 
(see Exercise 11). 

According to the Modified Newton Method Theorem (Section 10.1) the rate of 
convergence of this method is determined by the eigenvalues of the matrix 


[I+ cVh(x,)! Vh(x,)]' Q(x), (41) 


where Q(x,) is the Hessian of qg at x,. In view of (26), as c > oo the matrix (41) 
will have m eigenvalues that approach unity, while the remaining n — m eigenvalues 
approach the eigenvalues of Ly, evaluated at the solution x* of (38). Thus, if the 
smallest and largest eigenvalues of Ly, a and A, are located such that the interval 
[a, A] contains unity, the convergence ratio of this modified Newton’s method will 
be equal (in the limit of c > oo) to the canonical ratio [(A—a)/(A+a)]* for 
problem (38). 

If the eigenvalues of Ly, are not spread below and above unity, the convergence 
rate will be slowed. If a point in the interval containing the eigenvalues of L,, is 
known, a scalar factor can be introduced so that the canonical rate is achieved, but 
such information is often not easily available. 


Inequalities 


If there are inequality as well as equality constraints in the problem, the analogous 
procedure can be applied to the associated penalty objective function. The unusual 
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feature of this case is that corresponding to an inequality constraint g,(x) < 0, 
the term Vg; (x)? Vg; (x) used in the iteration matrix will suddenly appear if the 
constraint is violated. Thus the iteration matrix is discontinuous with respect to x, 
and as the method progresses its nature changes according to which constraints 
are violated. This discontinuity does not, however, imply that the method is 
subject to jamming, since the result of Exercise 4, Chapter 10 is applicable to this 
method. 


13.5 CONJUGATE GRADIENTS AND PENALTY 
METHODS 


The partial conjugate gradient method proposed and analyzed in Section 9.5 is 
ideally suited to penalty or barrier problems having only a few active constraints. If 
there are m active constraints, then taking cycles of m+ 1 conjugate gradient steps 
will yield a rate of convergence that is independent of the penalty constant c. For 
example, consider the problem having only equality constraints: 


minimize f(x) 
(42) 
subject to h(x) =0, 


where x € E”, h(x) € E”, m <n. Applying the standard quadratic penalty method, 
we solve instead the unconstrained problem 


minimize f(x) + $c|h(x)|? (43) 


for some large c. The objective function of this problem has a Hessian matrix 
that has m eigenvalues that are of the order c in magnitude, while the remaining 
n—m eigenvalues are close to the eigenvalues of the matrix L,,, corresponding to 
problem (42). Thus, letting x,,, be determined from x, by taking m+ 1 steps of a 
(nonquadratic) conjugate gradient method, and assuming x, — X, a solution to (43), 
the sequence {f(x,)} converges linearly to f(x) with a convergence ratio equal to 


approximately 
Aja? 
; 44 
(4 + *) - 


where a and A are, respectively, the smallest and largest eigenvalues of Ly, (xX). 

This is an extremely effective technique when m is relatively small. The 
programming logic required is only slightly greater than that of steepest descent, 
and the time per iteration is only about m+ | times as great as for steepest descent. 
The method can be used for problems having inequality constraints as well but it 
is advisable to change the cycle length, depending on the number of constraints 
active at the end of the previous cycle. 
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Example 3. 


10 
minimize f(x), %2,---, X19) = )_ kx 
k=l 


subject to 1.5x,;+ xX+ %x3,4+0.5x4+ 0.5x5=5.5 
2.0x5—O.5x7—-O.5xg+ X%— Xp =2.0 

Xp+ X34 X5st  xX+ Xy = 10 

Xyt Xgt At Xt xX =H 15. 


This problem was treated by the penalty function approach, and the resulting 
composite function was then solved for various values of c by using various cycle 
lengths of a conjugate gradient algorithm. In Table 13.1 p is the number of conjugate 
gradient steps in a cycle. Thus, p = 1 corresponds to ordinary steepest descent; 
p=5 corresponds, by the theory of Section 9.5, to the smallest value of p for which 
the rate of convergence is independent of c; and p = 10 is the standard conjugate 
gradient method. Note that for p < 5 the convergence rate does indeed depend on 
c, while it is more or less constant for p >5. The value of c’s selected are not 
artificially large, since for c = 200 the constraints are satisfied only to within 0.5 
percent of their right-hand sides. For problems with nonlinear constraints the results 
will most likely be somewhat less favorable, since the predicted convergence rate 
would apply only to the tail of the sequence. 


Table 13.1 
Number of Value of 
cycles to modified 
p (steps per cycle) convergence No. of steps objective 
( 1 90 90 388.565 
c= 20 3 8 24 388.563 
~ 5 3 15 388.563 
7 3 21 388.563 
f L 230* 230 488.607 
3 21 63 487.446 
Gem 5 4 20 487.438 
7 2 14 487.433 
tf 1 260* 260 525.238 
3 45* 135 503.550 
Oe ann 5 3 15 500.910 
( 7 3 21 500.882 


* Program not run to convergence due to excessive time. 
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13.6 NORMALIZATION OF PENALTY FUNCTIONS 


There is a good deal of freedom in the selection of penalty or barrier functions that 
can be exploited to accelerate convergence. We propose here a simple normalization 
procedure that together with a two-step cycle of conjugate gradients yields the 
canonical rate of convergence. Again for simplicity we illustrate the technique for 
the penalty method applied to the problem 


minimize f(x) 
(45) 
subject to h(x) =0 


as in Sections 13.4 and 13.5, but the idea is easily extended to other penalty or 
barrier situations. 
Corresponding to (45) we consider the family of quadratic penalty functions 


P(x) = +h(x)’Th(x), (46) 


where I" is a symmetric positive definite m x m matrix. We ask what the best choice 
of T° might be. 
Letting 


q(c, X) = f(x) + cP(x), (47) 
the Hessian of q turns out to be, using (26), 
Q(c, x.) = L(x,) + cVh(x,)/TVh(x,). (48) 


The m large eigenvalues are due to the second term on the right. The observation 
we make is that although the m large eigenvalues are all proportional to c, they are 
not necessarily all equal. Indeed, for very large c these eigenvalues are determined 
almost exclusively by the second term, and are therefore c times the nonzero 
eigenvalues of the matrix Vh(x,)’Vh(x,). We would like to select F so that these 
eigenvalues are not spread out but are nearly equal to one another. An ideal choice 
for the kth iteration would be 


P= [Vh(x,) Vh(x,)/]', (49) 


since then all nonzero eigenvalues would be exactly equal. However, we do not 
allow to change at each step, and therefore compromise by setting 


P= [Vh(x.) Vh(xo)/]', (50) 


where X, is the initial point of the iteration. 

Using this penalty function, the corresponding eigenvalue structure will at any 
point look approximately like that shown in Fig. 13.5. The eigenvalues are bunched 
into two separate groups. As c is increased the smaller eigenvalues move into the 
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} soc 3—_ 3} > 
0 a A c 


Fig. 13.5 Eigenvalue distributions 


Table 13.2 
Number of cycles No. of | Value of modified 
p (steps per cycle) to convergence steps objective 
1 28 28 251.2657 
c=10 2 9 18 251.2657 
3 5 15 251.2657 
1 153 153 379.5955 
c= 100 | 2 13 26 379.5955 
3 11 33 379.5955 
1 261* 261 402.0903 
c = 1000 | 2 14 28 400.1687 
3 13 39 400.1687 


* Program not run to convergence due to excessive time. 


interval [a, A] where a and A are, as usual, the smallest and largest eigenvalues of 
L,, at the solution to (45). The larger eigenvalues move forward to the right and 
spread further apart. 

Using the result of Exercise 11, Chapter 9, we see that if x,,, is determined 
from x, by two conjugate gradient steps, the rate of convergence will be linear at a 
ratio determined by the widest of the two eigenvalue groups. If our normalization is 
sufficiently accurate, the large-valued group will have the lesser width. In that case 
convergence of this scheme is approximately that of the canonical rate for the original 
problem. Thus, by proper normalization it is possible to obtain the canonical rate of 
convergence for only about twice the time per iteration as required by steepest descent. 

There are, of course, numerous variations of this method that can be used 
in practice. I’ can, for example, be allowed to vary at each step, or it can be 
occasionally updated. 


Example. The example problem presented in the previous section was also solved 
by the normalization method presented above. The results for various values of c 
and for cycle lengths of one, two, and three are presented in Table 13.2. (All runs 
were initiated from the zero vector.) 


13.7 PENALTY FUNCTIONS AND GRADIENT 
PROJECTION 


The penalty function method can be combined with the idea of the gradient 
projection method to yield an attractive general purpose procedure for solving 
constrained optimization problems. The proposed combination method can be 
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viewed either as a way of accelerating the rate of convergence of the penalty 
function method by eliminating the effect of the large eigenvalues, or as a technique 
for efficiently handling the delicate and usually cumbersome requirement in the 
gradient projection method that each point be feasible. The combined method 
converges at the canonical rate (the same as does the gradient projection method), 
is globally convergent (unlike the gradient projection method), and avoids much of 
the computational difficulty associated with staying feasible. 


Underlying Concept 


The basic theoretical result that motivates the development of this algorithm is the 
Combined Steepest Descent and Newton’s Method Theorem of Section 10.7. The 
idea is to apply this combined method to a penalty problem. For simplicity we first 
consider the equality constrained problem 


minimize f(x) 


(51) 
subject to h(x) =0, 


where x € E”, h(x) € E”. The associated unconstrained penalty problem that we 
consider is 


minimize q(x), (52) 
where 


q(x) = f(x) + 5elh(x)|?. 


At any point x, let M(x,) be the subspace tangent to the surface S, = {x : 
h(x) = h(x,)}. This is a slight extension of the tangent subspaces that we have 
considered before, since M(x,) is defined even for points that are not feasible. If 
the sequence {x,} converges to a solution x. of problem (52), then we expect that 
M(x,) will in some sense converge to M(x,). The orthogonal complement of M(x,) 
is the space generated by the gradients of the constraint functions evaluated at x,. 
Let us denote this space by M(x;,). The idea of the algorithm is to take N as the 
subspace over which Newton’s method is applied, and M as the space over which 
the gradient method is applied. A cycle of the algorithm would be as follows: 


1. Given x,, apply one step of Newton’s method over, the subspace N(x,,) to obtain 
a point w, of the form 


Ww, = X& + Via(x,) uy 


u, EE”, 
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2. From w,, take an ordinary steepest descent step to obtain x;,,. 


Of course, we must show how Step | can be easily executed, and this is done below, 
but first, without drawing out the details, let us examine the general structure of 
this algorithm. 

The process is illustrated in Fig. 13.6. The first step is analogous to the step 
in the gradient projection method that returns to the feasible surface; except that 
here the criterion is reduction of the objective function rather than satisfaction 
of constraints. To interpret the second step, suppose for the moment that the 
original problem (51) has a quadratic objective and linear constraints; so that, 
consequently, the penalty problem (52) has a quadratic objective and N(x), M(x) 
and Vh(x) are independent of x. In that case the first (Newton) step would 
exactly minimize g with respect to N, so that the gradient of qg at w, would be 
orthogonal to N; that is, the gradient would lie in the subspace M. Furthermore, 
since Vq(w,) = Vf(w,) + ch(w,)Vh(w,), we see that Vg(w,) would in that 
case be equal to the projection of the gradient of f onto M. Hence, the second 
step is, in the quadratic case exactly, and in the general case approximately, a 
move in the direction of the projected negative gradient of the original objective 
function. 

The convergence properties of such a scheme are easily predicted from the 
theorem on the Combined Steepest Descent and Newton’s Method, in Section 10.7, 
and our analysis of the structure of the Hessian of the penalty objective function 
given by (26). As x, — x, the rate will be determined by the ratio of largest to 
smallest eigenvalues of the Hessian restricted to M(x,). 

This leads, however, by what was shown in Section 12.3, to approximately the 
canonical rate for problem (51). Thus this combined method will yield again the 
canonical rate as c > oo. 


x 
k+1 Vh(x,)" 
XK 
Wk 
M(x) + X 
M(X;) + Wy 
h(x)=0 


Fig. 13.6 Illustration of the method 
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Implementing the First Step 


To implement the first step of the algorithm suggested above it is necessary to show 
how a Newton step can be taken in the subspace NM(x,). We show that, again for 
large values of c, this can be accomplished easily. 

At the point x, the function b, defined by 


b(u) = 4(x, + Vh(x,)"u) (53) 
for u€¢ £”, measures the variations in q with respect to displacements in N(x,). 


We shall, for simplicity, assume that at each point, x,, Wh(x,) has rank m. We can 
immediately calculate the gradient with respect to u, 


Vb(u) = Va(x, + Vh(x,)/u) Vh(x,)’, (54) 
and the m x n Hessian with respect to u at u= 0, 
B= Vh(x,)Q(x,) Va(x,)"- (55) 
where Q is the n x n Hessian of q with respect to x. From (26) we have that at x, 
Q(x,) = 1, (x,) + cVh(x,)’ Vh(x,). (56) 
And given B, the direction for the Newton step in N would be 


d, = —Vh(x,)’B-! Vc(0)" 


(57) 
= —Vh(x,)’B-' Vh(x,) Vq(x,)". 


It is clear from (55) and (56) that exact evaluation of the Newton step requires 
knowledge of L(x,) which usually is costly to obtain. For large values of c, however, 
B can be approximated by 


B ~ c[Vh(x,) Vh(x,)"F, (58) 


and hence a good approximation to the Newton direction is 
1 is 
d, = ~ Wh, )'[TVh(x,) Va) "] *Wh(x,) Vq(x;,)’- (59) 
Thus a suitable implementation of one cycle of the algorithm is: 
1. Calculate 


1 
d, = 


c 


Vh(x,)’[Vh(x,) Vh(x,)"]} ? Vh(x,) Vq(x,)". 
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2. Find B, to minimize q(x,-+ Bd,) (using 8, = 1 as an initial search point), and 
set Ww, = X,+ B, dk. 

. Calculate p, = —Vq(w,)’. 

4. Find a, to minimize q(w, + ap,), and set x,,, = W, + @,D,. 


ios) 


It is interesting to compare the Newton step of this version of the algorithm 
with the step for returning to the feasible region used in the ordinary gradient 
projection method. We have 


Va(x;,)’ = VF)" + cVb(x,)h(x,). (60) 


If we neglect Vf(x,)’ on the right (as would be valid if we are a long distance 
from the constraint boundary) then the vector d, reduces to 


d, = —Vh(x,)"[Vh(x,) Vh(x,)"] "h(x;), 


which is precisely the first estimate used to return to the boundary in the gradient 
projection method. The scheme developed in this section can therefore be regarded 
as one which corrects this estimate by accounting for the variation in /. 

An important advantage of the present method is that it is not necessary to carry 
out the search in detail. If 8 = 1 yields an improved value for the penalty objective, 
no further search is required. If not, one need search only until some improvement 
is obtained. At worst, if this search is poorly performed, the method degenerates 
to steepest descent. When one finally gets close to the solution, however, B = | is 
bound to yield an improvement and terminal convergence will progress at nearly 
the canonical rate. 


Inequality Constraints 


The procedure is conceptually the same for problems with inequality constraints. 
The only difference is that at the beginning of each cycle the subspace M(x,) is 
calculated on the basis of those constraints that are either active or violated at x,, 
the others being ignored. The resulting technique is a descent algorithm in that the 
penalty objective function decreases at each cycle; it is globally convergent because 
of the pure gradient step taken at the end of each cycle; its rate of convergence 
approaches the canonical rate for the original constrained problem as c — oo; and 
there are no feasibility tolerances or subroutine iterations required. 


13.8 EXACT PENALTY FUNCTIONS 


It is possible to construct penalty functions that are exact in the sense that the 
solution of the penalty problem yields the exact solution to the original problem 
for a finite value of the penalty parameter. With these functions it is not necessary 
to solve an infinite sequence of penalty problems to obtain the correct solution. 
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However, a new difficulty introduced by these penalty functions is that they are 
nondifferentiable. 
For the general constrained problem 
minimize f(x) 
subject to h(x) =0 (61) 
g(x) <0, 


consider the absolute-value penalty function 


m Pp 


P(x) = Yo [h,(8)| + Lo max (0, g,(%)). (62) 


The penalty problem is then, as usual, 
minimize f(x) +cP(x) (63) 
for some positive constant c. We investigate the properties of the absolute-value 


penalty function through an example and then generalize the results. 


Example 1. Consider the simple quadratic problem 


minimize 2x*+2xy+y*—2y 
(64) 
subject to x=0. 


It is easy to solve this problem directly by substituting x = 0 into the objective. 
This leads immediately to x = 0, y= 1. 
If a standard quadratic penalty function is used, we minimize the objective 


2x? + 2xy+y? —2y+ sex? (65) 


for c > 0. The solution again can be easily found and is x = —2/(2+c), y= 
1 —2/(2+c). This solution approaches the true solution as c > oo, as predicted by 
the general theory. However, for any finite c the solution is inexact. 

Now let us use the absolute-value penalty function. We minimize the function 


2x? + Ixy+ y* —2y+c|x|. (66) 
We rewrite (66) as 
2x? + 2xy+y?—2y+e|x| 
= 2x*+2xy+e|x|+(y—1-1 
= 2x* +2x4+clx| + (y—1)? + 2x(y—1)-1 
= x? + (2x+e|x|)+(y—14+%)?-1. 


(67) 
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All terms (except the —1) are nonnegative if c > 2. Therefore, the minimum value 
of this expression is —1, which is achieved (uniquely) by x = 0, y = 1. Therefore, 
for c > 2 the minimum point of the penalty problem is the correct solution to the 
original problem (64). 

We let the reader verify that \ = —2 for this example. The fact that c > |A| is 
required for the solution to be exact is an illustration of a general result given by 
the following theorem. 


Exact Penalty Theorem. Suppose that the point x* satisfies the second-order 
sufficiency conditions for a local minimum of the constrained problem (61). Let 
d and p be the corresponding Lagrange multipliers. Then for c > max {|Aj|, M4; : 
i=1,2,...,m,j=1,2,...,p},x* is also a local minimum of the absolute- 
value penalty objective (62). 


Proof. For simplicity we assume that there are equality constraints only. Define 
the primal function 


w(z) = min{ f(x) : h(x) =z; for i=1,2,...,m}. (68) 


The primal function was introduced in Section 12.3. Under our assumption the 
function exists in a neighborhood of x* and is continuously differentiable, with 
Vo(0) =—v’. 

Now define 


m 


(2) = w(2) +6 |e 


Then we have 


m m 


min{ f(x) +6) |hi(x)|} = min{ f(x) + ¢ | |zi| : n(x) = 2} 


m 


= min{p(2) +e) lel 
= min p,(Z). 
By the Mean Value Theorem, 
w(z) = w(0) + Va(az)z 


for some a, 0< a < 1. Therefore, 


w,(2) = 0(0) + Vo(az)2-+e5- |e (69) 


i=1 
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We know that Vw(z) is continuous at 0, and thus given ¢ > 0 there is a neighborhood 
of 0 such that |Vw(z);| < |A;|-+ ¢. Thus 


Vo(az)z= >) Vo(az),z; > —{max | Vw(az);|} > 2|z;| 
i=l : i=l 
> —{max(]Aj| te} > lzil- 


i=1 


Using this in (69), we obtain 


(2) > p(0) + (e— e—max|hj|) 97 1z\- 


i=1 


For c > ¢+max|A,| it follows that w,(z) is minimized at z= 0. Since ¢ was 
arbitrary, the result holds for c > max |A,|. 

This result is easily extended to include inequality constraints. (See 
Exercise 16.) ff 


It is possible to develop a geometric interpretation of the absolute-value penalty 
function analogous to the interpretation for ordinary penalty functions given in 
Fig. 13.4. Figure 13.7 corresponds to a problem for a single constraint. The smooth 
curve represents the primal function of the problem. Its value at 0 is the value of 
the original problem, and its slope at 0 is —\. The function w,(z) is obtained by 
adding c|z| to the primal function, and this function has a discontinuous derivative 
at z= 0. It is clear that for c > |A|, this composite function has a minimum at 
exactly z = 0, corresponding to the correct solution. 


r wel zl 


0 


Fig. 13.7 Geometric interpretation of absolute-value penalty function 
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There are other exact penalty functions but, like the absolute-value penalty 
function, most are nondifferentiable at the solution. Such penalty functions are for 
this reason difficult to use directly; special descent algorithms for nondifferentiable 
objective functions have been developed, but they can be cumbersome. Furthermore, 
although these penalty functions are exact for a large enough c, it is not known at 
the outset what magnitude is sufficient. In practice a progression of c’s must often 
be used. Because of these difficulties, the major use of exact penalty functions in 
nonlinear programming is as merit functions—measuring the progress of descent 
but not entering into the determination of the direction of movement. This idea is 
discussed in Chapter 15. 


13.9 SUMMARY 


Penalty methods approximate a constrained problem by an unconstrained problem 
that assigns high cost to points that are far from the feasible region. As the 
approximation is made more exact (by letting the parameter c tend to infinity) the 
solution of the unconstrained penalty problem approaches the solution to the original 
constrained problem from outside the active constraints. Barrier methods, on the 
other hand, approximate a constrained problem by an (essentially) unconstrained 
problem that assigns high cost to being near the boundary of the feasible region, 
but unlike penalty methods, these methods are applicable only to problems having a 
robust feasible region. As the approximation is made more exact, the solution of the 
unconstrained barrier problem approaches the solution to the original constrained 
problem from inside the feasible region. 

The objective functions of all penalty and barrier methods of the form P(x) = 
y(h(x)), B(x) = n(g(x)) are ill-conditioned. If they are differentiable, then as c > 
co the Hessian (at the solution) is equal to the sum of L, the Hessian of the 
Lagrangian associated with the original constrained problem, and a matrix of rank 
r that tends to infinity (where r is the number of active constraints). This is a 
fundamental property of these methods. 

Effective exploitation of differentiable penalty and barrier functions requires 
that schemes be devised that eliminate the effect of the associated large eigen- 
values. For this purpose the three general principles developed in earlier chapters, 
The Partial Conjugate Gradient Method, The Modified Newton Method, and The 
Combination of Steepest Descent and Newton’s Method, when creatively applied, 
all yield methods that converge at approximately the canonical rate associated with 
the original constrained problem. 

It is necessary to add a point of qualification with respect to some of the 
algorithms introduced in this chapter, lest it be inferred that they are offered as 
panaceas for the general programming problem. As has been repeatedly emphasized, 
the ideal study of convergence is a careful blend of analysis, good sense, and 
experimentation. The rate of convergence does not always tell the whole story, 
although it is often a major component of it. Although some of the algorithms 
presented in this chapter asymptotically achieve the canonical rate of convergence 
(at least approximately), for large c the points may have to be quite close to the 
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solution before this rate characterizes the process. In other words, for large c the 
process may converge slowly in its initial phase, and, to obtain a truly representative 
analysis, one must look beyond the first-order convergence properties of these 
methods. For this reason many people find Newton’s method attractive, although 
the work at each step can be substantial. 


13.10 EXERCISES 


1. Show that if g(c,x) is continuous (with respect to x) and q(c,x) > © as |x| > ©, 
then g(c, x) has a minimum. 


2. Suppose problem (1), with f continuous, is approximated by the penalty problem (2), 
and let {c,} be an increasing sequence of positive constants tending to infinity. Define 
q(c, xX) = f(x) + cP(x), and fix ¢ > 0. For each k let x, be determined satisfying 


GK, Xe) [min q(cx, x) + €. 


Show that if x* is a solution to (1), any limit point, x, of the sequence {x;,} is feasible 
and satisfies f(x) < f(x*) +e. 


3. Construct an example problem and a penalty function such that, as c — oo, the solution 
to the penalty problem diverges to infinity. 


4. Combined penalty and barrier method. Consider a problem of the form 


minimize f(x) 


subject to xe SNT 
and suppose P is a penalty function for S and B is a barrier function for T. Define 
d(c, x) = f(x) +cP(x) + B(x). 
Let {c,} be a sequence c, — oo, and for k = 1,2,... let x, be a solution to 
minimize d(c,,xX) 
subject to x € interior of T. Assume all functions are continuous, T is compact (and 


robust), the original problem has a solution x*, and that SNM [interior of T] is not empty. 
Show that 
a) limitd(c,, x.) = f(x"). 


kEoo 


b) limite, P(x,) = 0. 
eee | 

c) limit— B(x,) =0. 
Ck 


5. Prove the Theorem at the end of Section 13.2. 


6. Find the central path for the problem of minimizing x* subject to x > 0. 
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7. Consider a penalty function for the equality constraints 
h(x) = 0, h(x) € E”, 


having the form 
P(x) = y(a(x)) = Dd) w(h;(x)), 


i=1 


where w is a function whose derivative w’ is analytic and has a zero of order s > | at 
zero. 


a) Show that corresponding to (26) we have 
Qe, Xe) = Le) + eK Di {w" (hi (%x))} Wj (Xe)! VA; (X;).- 


i=1 


b) Show that as c, > co, m eigenvalues of Q(c,, x,) have magnitude on the order of 
(c,)'". 


8. Corresponding to the problem 


minimize f(x) 


subject to g(x) <0, 
consider the sequence of unconstrained problems 
minimize f(x)+[g*(x)+1]‘-1, 


and suppose x; is the solution to the kth problem. 


a) Find an appropriate definition of a Lagrange multiplier \; to associate with x,. 
b) Find the limiting form of the Hessian of the associated objective function, and 
determine how fast the largest eigenvalues tend to infinity. 


9. Repeat Exercise 8 for the sequence of unconstrained problems 
minimize f(x) +[(g(x)+1)*]}*. 
10. Morrison’s method. Suppose the problem 


minimize f(x) 
(70) 
subject to h(x)=0 


has solution x*. Let M be an optimistic estimate of f(x*), that is, M < f(x*). Define 
v(M, x) = [ f(x) — MP’ + |h(x)|? and define the unconstrained problem 


minimize v(M,x). (71) 
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Given M, < f(x*), a solution x,, to the corresponding problem (71) is found, then M, 
is updated through 


My, = M, + [UM xy, )]'? (72) 


and the process repeated. 


a) Show that if M = f(x*), a solution to (71) is a solution to (70). 

b) Show that if x,, is a solution to (71), then f(x,,) < f(x*). 

c) Show that if M, < f(x*) then M,,, determined by (72) satisfies My; < f(x*). 

d) Show that M, — f(x*). 

e) Find the Hessian of v(M, x) (with respect to x*). Show that, to within a scale factor, 
it is identical to that associated with the standard penalty function method. 


11. Let A be an m x n matrix of rank m. Prove the matrix identity 
[I+ A7A]"' =I1-A[I+AA7]'A 


and discuss how it can be used in conjunction with the method of Section 13.4. 


12. Show that in the limit of large c, a single cycle of the normalization method of 
Section 13.6 is exactly the same as a single cycle of the combined penalty function and 
gradient projection method of Section 13.7. 


13. Suppose that at some step k of the combined penalty function and gradient projection 
method, the m x n matrix Vh(x,) is not of rank m. Show how the method can be 
continued by temporarily executing the Newton step over a subspace of dimension less 
than m. 


14. For a problem with equality constraints, show that in the combined penalty function 
and gradient projection method the second step (the steepest descent step) can be 
replaced by a step in the direction of the negative projected gradient (projected onto 
M,) without destroying the global convergence property and without changing the rate 
of convergence. 


15. Develop a method that is analogous to that of Section 13.7, but which is a combination 
of penalty functions and the reduced gradient method. Establish that the rate of 
convergence of the method is identical to that of the reduced gradient method. 


16. Extend the result of the Exact Penalty Theorem of Section 13.8 to inequalities. Write 
g(x) <0 in the form of an equality as g,(x) +y? = 0 and show that the original 
theorem applies. 


17. Develop a result analogous to that of the Exact Penalty Theorem of Section 13.8 for the 
penalty function 


P(x) = max{0, g;(X), 82(X), ---+ 8p(X), |Ai(X)], [A2(&)], -- +» [Am (X)I}- 
18. Solve the problem 


minimize x°+xy+y?—2y 


subject to x+y=2 


three ways analytically 
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a) with the necessary conditions. 
b) with a quadratic penalty function. 
c) with an exact penalty function. 
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Chapter14 DUAL AND 
CUTTING PLANE 
METHODS 


Dual methods are based on the viewpoint that it is the Lagrange multipliers which 
are the fundamental unknowns associated with a constrained problem; once these 
multipliers are known determination of the solution point is simple (at least in 
some situations). Dual methods, therefore, do not attack the original constrained 
problem directly but instead attack an alternate problem, the dual problem, whose 
unknowns are the Lagrange multipliers of the first problem. For a problem with n 
variables and m equality constraints, dual methods thus work in the m-dimensional 
space of Lagrange multipliers. Because Lagrange multipliers measure sensitivities 
and hence often have meaningful intuitive interpretations as prices associated with 
constraint resources, searching for these multipliers, is often, in the context of a 
given practical problem, as appealing as searching for the values of the original 
problem variables. 

The study of dual methods, and more particularly the introduction of the dual 
problem, precipitates some extensions of earlier concepts. Thus, perhaps the most 
interesting feature of this chapter is the calculation of the Hessian of the dual problem 
and the discovery of a dual canonical convergence ratio associated with a cons- 
trained problem that governs the convergence of steepest ascent applied to the dual. 

Cutting plane algorithms, exceedingly elementary in principle, develop a series 
of ever-improving approximating linear programs, whose solutions converge to the 
solution of the original problem. The methods differ only in the manner by which an 
improved approximating problem is constructed once a solution to the old approx- 
imation is known. The theory associated with these algorithms is, unfortunately, 
scant and their convergence properties are not particularly attractive. They are, 
however, often very easy to implement. 


14.1 GLOBAL DUALITY 


Duality in nonlinear programming takes its most elegant form when it is formu- 
lated globally in terms of sets and hyperplanes that touch those sets. This theory 
makes clear the role of Lagrange multipliers as defining hyperplanes which can be 
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considered as dual to points in a vector space. The theory provides a symmetry 
between primal and dual problems and this symmetry can be considered as perfect 
for convex problems. For non-convex problems the “imperfection” is made clear 
by the duality gap which has a simple geometric interpretation. The global theory, 
which is presented in this section, serves as useful background when later we 
specialize to a local duality theory that can be used even without convexity and 
which is central to the understanding of the convergence of dual algorithms. 

As a counterpoint to Section 11.9 where equality constraints were considered 
before inequality constraints, here we shall first consider a problem with inequality 
constraints. In particular, consider the problem 


minimize f(x) (1) 
subject to g(x) <0 
xe). 


Qc E” is a convex set, and the functions f and g are defined on ©. The function g 
is p-dimensional. The problem is not necessarily convex, but we assume that there 
is a feasible point. Recall that the primal function associated with (1) is defined for 
Ze E? as 


w(z) = inf { f(x): g(x) <z,x € O}, (2) 


defined by letting the right hand side of inequality constraint take on arbitrary 
values. It is understood that (2) is defined on the set D = {z: g(x) <z, for some 
xe QO}. 

If problem (1) has a solution x* with value f* = f(x*), then f* is the point on 
the vertical axis in E?*! where the primal function passes through the axis. If (1) 
does not have a solution, then f* = inf{f(x) : g(x) < 0, x € O} is the intersection 
point. 

The duality principle is derived from consideration of all hyperplanes that lie 
below the primal function. As illustrated in Fig. 14.1 the intercept with the vertical 
axis of such a hyperplanes lies below (or at) the value f*. 


Hyperplane 
below w(z) 


Fig. 14.1 Hyperplane below w(z) 
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To express this property we define the dual function defined on the positive 
cone in E? as 


(mw) = inf (f(x) +p"e(x) :x € O}. ) 


In general, @ may not be finite throughout the positive orthant E”. but the region 
where it is finite is convex. 


Proposition 1. The dual function is concave on the region where it is finite. 


Proof. Suppose p,, , are in the finite region, and let 0 < a < 1. Then 


b(ap, + (1 — apy)) = inf (f(x) + (ap, + (1— a) py)" g(x) : x € O} 
> inf {af(x,) + apjg(x,) :x, € O} 
+inf {(1—a@) f(x) + (1—@)pj g(x) : x, € O} 
= ad(p,)+(1—a)h(m2). 0 


We define * = sup{(p) : p > 0} where it is understood that the supremum 
is taken over the region where ¢ is finite. We can now state the weak form of 
global duality. 


Weak Duality Proposition. $* < f*. 


Proof. For every p > 0 we have 


b(w) = inf {f(x) +p"g(x):x € Q} 
< inf {f(x)+p"g(x) :2(x) <0,x € O} 
<inf {f(x): g(x) <0,xe QO}= f*. 


Taking the supremum over the left hand side gives ¢* < f*. ff 


Hence the dual function gives lower bounds on the optimal value /*. 

This dual function has a strong geometric interpretation. Consider a p+ 1- 
dimensional vector (1, w) € E?*! with p > 0 and a constant c. The set of vectors 
(r,z) such that the inner product (1, w)?(r,z) =r+p7z=c defines a hyperplane 
in E?*', Different values of c give different hyperplanes, all of which are parallel. 

For a given (1, w) we consider the lowest possible hyperplane of this form that 
just barely touches (supports) the region above the primal function of problem (1). 
Suppose x, defines the touching point with values r = f(x,) and z = g(x,). Then 
c= f(x) +p" a(x) = O(p). 

The hyperplane intersects the vertical axis at a point of the form (7, 0). This 
point also must satisfy (1, p)7(7%, 0) = c = d(p). This gives c =r). Thus the 
intercept gives }(w) directly. Thus the dual function at pr is equal to the intercept 
of the hyperplane defined by p that just touches the epigraph of the primal function. 
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Duality gap 
y* 


Jee 


Highest hyperplane 


Fig. 14.2 The highest hyperplane 


Furthermore, this intercept (and dual function value) is maximized by the 
Lagrange multiplier which corresponds to the largest possible intercept, at a point 
no higher than the optimal value f*. See Fig. 14.2. 

By introducing convexity assumptions, the foregoing analysis can be 
strengthened to give the strong duality theorem, with no duality gap when the 
intercept is at f*. See Fig. 14.3. 

We shall state the result for the more general problem that includes equality 
constraints of the form h(x) = 0, as in Section 11.9. 

Specifically, we consider the problem 


maximize f(x) (4) 
subject to h(x)=0, g(x) <0 
xEeQ 


where h is affine of dimension m, g is convex of dimension p, and Q, is a convex 
set. 


a 


Optimal 
hyperplane 


Fig. 14.3 The strong duality theorem. There is no duality gap 
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In this case the dual function is 
(N, p) = inf { F(X) + ATH(K) +p" B(x) : x € OQ}. 
And 
p* = sup{P(A,p): NEE", pe E?, p> O}. 


Strong Duality Theorem. Suppose in the problem (4), h is regular with respect 
to © and there is a point x, € with that h(x) = 0 and g(x) < 0. 

Suppose the problem has solution x* with value f(x*) = f*. Then for every 
dX and p > 0 there holds 


p< f*. 
Furthermore, there are \, w > 9 such that 
A, pm) =f 


and hence * = f*. Moreover, the ® and w above are Lagrange multipliers 
for the problem. 


Proof. The proof follows almost immediately from the Zero-order Lagrange 


Theorem of Section 11.9. The Lagrange multipliers of that theorem give 
f" = max{ f(x) +A7h(x) +B "g(x) :x € O} 
=A. p)<o sf. 
Equality must hold across the inequalities, which establishes the results. ff 


As a nice summary we can place the primal and dual problems together. 
fi =min o(z) 


subject to z<Q0 Primal 


gp’ =max (ph) 


subject to po>O. Dual 


Example 1. (Quadratic program). Consider the problem 
1 
minimize 5x Ox (5) 
subject to Bx—b<0. 


The dual function is 


#(u) = min 5x"Qx +p (Bx). 
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This gives the necessary conditions 
Qx+B’p=0 


and hence x = —Q°'B’p. Substituting this into b(w) gives 


1 2 
$(H) =—5H'BQ 'B'p—p'b. 
Hence the dual problem is 
1 
maximize — 5h BQ IB’ p —p'b (6) 
subject to pO, 
which is also a quadratic programming problem. If this problem is solved for p, 
that will be the Lagrange multiplier for the primal problem (5). 
Note that the first-order conditions for the dual problem (6) imply 
p."[—BQ 'B"p —b] = 0, 
which by substituting the formula for x is equivalent to 
T = 
p° [Bx — b] = 0. 
This is the complementary slackness condition for the original (primal) problem (5). 
Example 4 (Integer solutions). Duality gaps may arise if the object function or 
the constraint functions are not convex. A gap may also arise if the underlying 
set is not convex. This is characteristic, for example, of problems in which the 
components of the solution vector are constrained to be integers. For instance, 


consider the problem 


minimize x; +2x5 
subject to x,+x, > 1/2 
X,, X, nonnegative integers 


It is clear that the solution is x, = 1, x, =0, with objective value f* = 1. To put 
this problem in the standard form we have discussed, we write the constraint as 


—XxX,;—X,+1/2<z, where z=0. 
The primal function w(z) is equal to 0 for z> 1/2 since then x, = x, = 0 is feasible. 


The entire primal function has steps as z steps negatively integer by integer, as 
shown in Fig. 14.4. 
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w(z) 


Hyperplane 
with p=1 ~~ 


Duality gap 


Fig. 14.4 Duality for an integer problem 


The dual function is 
(uw) = max {x3 +43 — A(x, +x, —1/2)} 


where the maximum is taken with respect to the integer constraint. Analytically, 
the solution for small values of pu is 


b(m) = p/2 forO<p<l, 
=l-p/2 forl<p<2, 


and more 


The maximum value of (jw) is the maximum intercept of the corresponding 
hyperplanes (lines, in this case) with the vertical axis. This occurs for w = 1 with 
a corresponding value of d* = $(1) = 1/2. We have ¢* < f* and the difference 
f* —* =1/2 is the duality gap. 


14.2 LOCAL DUALITY 


In practice the mechanics of duality are frequently carried out locally, by setting 
derivatives to zero, or moving in the direction of a gradient. For these operations 
the beautiful global theory can in large measure be replaced by a weaker but often 
more useful local theory. This theory requires a minimum of convexity assumptions 
defined locally. We present such a theory in this section, since it is in keeping 
with the spirit of the earlier chapters and is perhaps the simplest way to develop 
computationally useful duality results. 

As often done before for convenience, we again consider nonlinear 
programming problems of the form 


minimize f(x) 


; (7) 
subject to h(x) = 0, 
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where x € E", h(x) € E” and f,h € C?. Global convexity is not assumed here. 
Everything we do can be easily extended to problems having inequality as well as 
equality constraints, for the price of a somewhat more involved notation. 

We focus attention on a local solution x* of (7). Assuming that x* is a regular 
point of the constraints, then, as we know, there will be a corresponding Lagrange 
multiplier (row) vector A* such that 


V(x") + (A*)" V(x") = 0, (8) 
and the Hessian of the Lagrangian 
L(x") = F(x") + (A*)"H(x") (9) 
must be positive semidefinite on the tangent subspace 
M = {x: Vh(x*)x = 0}. 


At this point we introduce the special local convexity assumption necessary 
for the development of the local duality theory. Specifically, we assume that the 
Hessian L(x*) is positive definite. Of course, it should be emphasized that by this we 
mean L(x*) is positive definite on the whole space E”, not just on the subspace M. 
The assumption guarantees that the Lagrangian /(x) = f(x) + (A*)’h(x) is locally 
convex at x*. 

With this assumption, the point x* is not only a local solution to the constrained 
problem (7); it is also a local solution to the unconstrained problem 


minimize f(x) +(A*)"h(x), (10) 


since it satisfies the first- and second-order sufficiency conditions for a local 
minimum point. Furthermore, for any A sufficiently close to A* the function 
f(x) + A7h(x) will have a local minimum point at a point x near x*. This follows 
by noting that, by the Implicit Function Theorem, the equation 


V f(x) +A’ Vh(x) = 0 (11) 


has a solution x near x* when A is near A*, because L* is nonsingular; and by the 
fact that, at this solution x, the Hessian F(x) + A’H(x) is positive definite. Thus 
locally there is a unique correspondence between A and x through solution of the 
unconstrained problem 


minimize f(x) + A’h(x). (12) 


Furthermore, this correspondence is continuously differentiable. 
Near A* we define the dual function $ by the equation 


(A) = minimum [ f(x) + A7h(x)], (13) 


14.2 Local Duality 443 


where here it is understood that the minimum is taken locally with respect to x 
near x*. We are then able to show (and will do so below) that locally the original 
constrained problem (7) is equivalent to unconstrained local maximization of the 
dual function @ with respect to A. Hence we establish an equivalence between a 
constrained problem in x and an unconstrained problem in A. 

To establish the duality relation we must prove two important lemmas. In the 
statements below we denote by x(A) the unique solution to (12) in the neighborhood 
of x*. 


Lemma 1. The dual function @ has gradient 


Vo(A) =h(x(A))’ (14) 


Proof. We have explicitly, from (13), 
(A) = f(x(A)) + ATH(K(A)). 
Thus 
Vo(A) = [Vf(x(A)) + AP Vh(x(A))]Vx(A) + h(x(A))’. 


Since the first term on the right vanishes by definition of x(A), we obtain (14). 


Lemma | is of extreme practical importance, since it shows that the gradient of 
the dual function is simple to calculate. Once the dual function itself is evaluated, 
by minimization with respect to x, the corresponding h(x)’, which is the gradient, 
can be evaluated without further calculation. 

The Hessian of the dual function can be expressed in terms of the Hessian of 
the Lagrangian. We use the notation L(x, A) = F(x) + A’ H(x), explicitly indicating 
the dependence on A. (We continue to use L(x*) when A = A* is understood.) We 
then have the following lemma. 


Lemma 2. The Hessian of the dual function is 


®(A) = —Vh(x(A))L! (x(A), A) Vh(x(A))”. (15) 


Proof. The Hessian is the derivative of the gradient. Thus, by Lemma 1, 
@D(A) = Vh(x(A)) Vx(A). (16) 
By definition we have 
Vf(x(A)) + A’ Vh(x(A)) = 0, 
and differentiating this with respect to A we obtain 


L(x(A), A)Vx(A) + Vh(x(A))? = 0. 
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Solving for Vx(A) and substituting in (16) we obtain (15). J 


Since L~!(x(A)) is positive definite, and since Vh(x(A)) is of full rank near 
x*, we have as an immediate consequence of Lemma 2 that the m x m Hessian of 
¢ is negative definite. As might be expected, this Hessian plays a dominant role in 
the analysis of dual methods. 


Local Duality Theorem. Suppose that the problem 


minimize f(x) 


(17) 
subject to h(x) =0 


has a local solution at x* with corresponding value r* and Lagrange multiplier 
X*. Suppose also that x* is a regular point of the constraints and that the 
corresponding Hessian of the Lagrangian L* = L(x") is positive definite. Then 
the dual problem 


maximize (A) (18) 


has a local solution at X* with corresponding value r* and x* as the point 
corresponding to X* in the definition of ©. 


Proof. It is clear that x* corresponds to A* in the definition of 6. Now at A* we 
have by Lemma 1 


Vo(A*) =h(x*)’ =0, 
and by Lemma 2 the Hessian of ¢ is negative definite. Thus A* satisfies the first- 
and second-order sufficiency conditions for an unconstrained maximum point of ¢. 
The corresponding value #(A*) is found from the definition of ¢ to be r*. ff 


Example 1. Consider the problem in two variables 


minimize —xy 


subject to (x—3)?+y’=5. 
The first-order necessary conditions are 


—y+(2x-—6)A=0 
—x+2yA =0 


together with the constraint. These equations have a solution at 
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The Hessian of the corresponding Lagrangian is 


[24], 


Since this is positive definite, we conclude that the solution obtained is a local 
minimum. (It can be shown, in fact, that it is the global solution.) 

Since L is positive definite, we can apply the local duality theory near this 
solution. We define 


(A) = min {—xy+ A[(x—3)? +y? — S]}, 


which leads to 
4A +4A3 — 80A° 


a (442 — 1) 


valid for A > . It can be verified that @ has a local maximum at A= 1. 


Inequality Constraints 


For problems having inequality constraints as well as equality constraints the above 
development requires only minor modification. Consider the problem 


minimize f(x) 
subject to h(x) =0 (19) 
g(x) <0, 


where g(x) € E?, g € C? and everything else is as before. Suppose x* is a local 
solution of (19) and is a regular point of the constraints. Then, as we know, there 
are Lagrange multipliers A* and jx* > 0 such that 


V(x") + (A*)" V(x") + (H")" Ve(x") = 0 (20) 
(w*)"g(x*) = 0. (21) 

We impose the local convexity assumptions that the Hessian of the Lagrangian 
L(x") = F(x") + (A*)"H(x*) + (w*)" G(x") (22) 


is positive definite (on the whole space). 
For A and w > 0 near A* and w* we define the dual function 


(A, w) = min[ f(x) + ATh(x) +p" g(x)], (23) 


where the minimum is taken locally near x*. Then, it is easy to show, paralleling 
the development above for equality constraints, that @ achieves a local maximum 
with respect to A, p > 0 at A*, p*. 
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Partial Duality 


It is not necessary to include the Lagrange multipliers of all the constraints of a 
problem in the definition of the dual function. In general, if the local convexity 
assumption holds, local duality can be defined with respect to any subset of 
functional constraints. Thus, for example, in the problem 


minimize f(x) 
subject to h(x) =0 (24) 
g(x) <0, 


we might define the dual function with respect to only the equality constraints. In 
this case we would define 


(A) = min {/(x) +A7h(3)}, (25) 


where the minimum is taken locally near the solution x* but constrained by the 
remaining constraints g(x) < 0. Again, the dual function defined in this way will 
achieve a local maximum at the optimal Lagrange multiplier A*. 


14.3 DUAL CANONICAL CONVERGENCE RATE 


Constrained problems satisfying the local convexity assumption can be solved 
by solving the associated unconstrained dual problem, and any of the standard 
algorithms discussed in Chapters 7 through 10 can be used for this purpose. Of 
course, the method that suggests itself immediately is the method of steepest ascent. 
It can be implemented by noting that, according to Lemma |. Section 14.2, the 
gradient of ¢ is available almost without cost once @ itself is evaluated. Without 
some special properties, however, the method as a whole can be extremely costly 
to execute, since every evaluation of @ requires the solution of an unconstrained 
problem in the unknown x. Nevertheless, as shown in the next section, many 
important problems do have a structure which is suited to this approach. 

The method of steepest ascent, and other gradient-based algorithms, when 
applied to the dual problem will have a convergence rate governed by the eigenvalue 
structure of the Hessian of the dual function @. At the Lagrange multiplier A* 
corresponding to a solution x* this Hessian is (according to Lemma 2, Section 13.1) 


® = —Vh(x*)(L*)“! Vh(x*)’. 


This expression shows that ® is in some sense a restriction of the matrix (L*)~! 
to the subspace spanned by the gradients of the constraint functions, which is 
the orthogonal complement of the tangent subspace M. This restriction is not the 
orthogonal restriction of (L*)~' onto the complement of M since the particular repre- 
sentation of the constraints affects the structure of the Hessian. We see, however, 
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that while the convergence of primal methods is governed by the restriction of L* 
to M, the convergence of dual methods is governed by a restriction of (L*)~! to 
the orthogonal complement of M. 

The dual canonical convergence rate associated with the original constrained 
problem, which is the rate of convergence of steepest ascent applied to the dual, 
is (B—b)?/(B+b)* where b and B are, respectively, the smallest and largest 
eigenvalues of 


—® = Vh(x*)(L*)“! Vh(x*)’. 


For locally convex programming problems, this rate is as important as the primal 
canonical rate. 


Scaling 


We conclude this section by pointing out a kind of complementarity that exists 
between the primal and dual rates. Suppose one calculates the primal and dual 
canonical rates associated with the locally convex constrained problem 


minimize f(x) 

subject to h(x) =0. 
If a change of primal variables x is introduced, the primal rate will in general change 
but the dual rate will not. On the other hand, if the constraints are transformed (by 


replacing them by Th(x) = 0 where T is a nonsingular m x m matrix), the dual rate 
will change but the primal rate will not. 


14.4 SEPARABLE PROBLEMS 


A structure that arises frequently in mathematical programming applications is that 
of the separable problem: 


minimize 3 FAx;) (26) 
i=l 

subject to 3 h,(x;) = 0 (27) 
i=l 
q 
DY g(x) <0. (28) 
i=l 


In this formulation the components of the n-vector x are partitioned into q disjoint 
groups, X = (X,, X5,..., X,) where the groups may or may not have the same number 
of components. Both the objective function and the constraints separate into sums 
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of functions of the individual groups. For each i, the functions f;, h;, and g; are 
twice continuously differentiable functions of dimensions 1, m, and p, respectively. 


Example 1. Suppose that we have a fixed budget of, say, A dollars that may be 
allocated among n activities. If x; dollars is allocated to the ith activity, then there 
will be a benefit (measured in some units) of f;(x;). To obtain the maximum benefit 
within our budget, we solve the separable problem 


n 
maximize > f,(x;) 
i=l 


subject to > x,<A (29) 
i=l 


> 0. 


i 
In the example x is partitioned into its individual components. 


Example 2. Problems involving a series of decisions made at distinct times are 
often separable. For illustration, consider the problem of scheduling water release 
through a dam to produce as much electric power as possible over a given time 
interval while satisfying constraints on acceptable water levels. A discrete-time 
model of this problem is to 


maximize > S(y(k), uk)) 


subject to y(k) = y(K—1)—u(k)+5(k), k=1,...,N 
c<y(k)<d, k=1,...,N 
O<u(k), k=1,...,N. 


Here y(k) represents the water volume behind the dam at the end of period k, u(k) 
represents the volume flow through the dam during period k, and s(k) is the volume 
flowing into the lake behind the dam during period k from upper streams. The 
function f gives the power generation, and c and d are bounds on lake volume. 
The initial volume y(0) is given. 

In this example we consider x as the 2N-dimensional vector of unknowns 
y(k), u(k),k = 1,2,...,N. This vector is partitioned into the pairs x, = 
(y(k), u(k)). The objective function is then clearly in separable form. The 
constraints can be viewed as being in the form (27) with h,(x,) having dimension 
N and such that h,(x;,) is identically zero except in the k and k+ 1 components. 


Decomposition 


Separable problems are ideally suited to dual methods, because the required uncon- 
strained minimization decomposes into small subproblems. To see this we recall 
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that the generally most difficult aspect of a dual method is evaluation of the dual 
function. For a separable problem, if we associate A with the equality constraints 
(27) and pw > 0 with the inequality constraints (28), the required dual function is 


(A, w) = min YY f,(x,) + ATH, (%,) + 78) (X). 


i=1 


This minimization problem decomposes into the g separate problems 
min f(x) + A*h,(x;) + pb’ g;(x;). 


The solution of these subproblems can usually be accomplished relatively 
efficiently, since they are of smaller dimension than the original problem. 


Example 3. In Example | using duality with respect to the budget constraint, the 
ith subproblem becomes, for uw > 0 


a F(x) — BX, 


which is only a one-dimensional problem. It can be interpreted as setting a benefit 
value yj for dollars and then maximizing total benefit from activity i, accounting 
for the dollar expenditure. 


Example 4. In Example 2 using duality with respect to the equality constraints we 
denote the dual variables by A(k),k =1,2,..., N. The kth subproblem becomes 


9 {f(y(k), u(k)) + [AK + 1) — ACK) yk) — ACA) [u&) — 5(A) 
O<u(k) 


which is a two-dimensional optimization problem. Selection of A ¢ EN decomposes 
the problem into separate problems for each time period. The variable A(k) can be 
regarded as a value, measured in units of power, for water at the beginning of period 
k. The kth subproblem can then be interpreted as that faced by an entrepreneur who 
leased the dam for one period. He can buy water for the dam at the beginning of 
the period at price A(k) and sell what he has left at the end of the period at price 
A(k +1). His problem is to determine y(k) and u(k) so that his net profit, accruing 
from sale of generated power and purchase and sale of water, is maximized. 


Example 5. (The hanging chain). Consider again the problem of finding the 
equilibrium position of the hanging chain considered in Example 4, Section 11.3, 
and Example 1, Section 12.7. The problem is 


n 
minimize )° c,y; 


i=1 
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n 


subject to > y,=0 


i=l 
3 1-y? 
where c; = n—i+ ‘, L = 16. This problem is locally convex, since as shown in 


Section 12.7 the Hessian of the Lagrangian is positive definite. The dual function 
is accordingly 


P(A, #) =min >? {eoitan +a i= x} ~p. 
i=1 


Since the problem is separable, the minimization divides into a_ separate 
minimization for each y,, yielding the equations 


BY; 


vi-y¥ 


c,+A- 


or 
(c. +A)(1—y;) = BY;- 
This yields 


—(c, +A) 


(FAP eye aii 


a 


The above represents a local minimum point provided yz < 0; and the minus sign 
must be taken for consistency. 


The dual function is then 


= =(¢ehAy we : 
oH) =D [(c, FA)? + p2]'? +u[ | | Ly 


or finally, using fu? = —p for p <0, 
P(A, ow) = LY) (e+ AP + Be. 
i=l 


The correct values of A and yw can be found by maximizing f(A, jw). One way to do 
this is to use steepest ascent. The results of this calculation, starting at A= w= 0, 
are shown in Table 14.1. The values of y, can then be found from (30). 
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Table 14.1 Results of Dual of Chain Problem 


Final solution 


A = —10.00048 
Iteration Value b= —6.761136 
0 —200.00000 y, = —.8147154 
1 —66.94638 Vy = —.7825940 
2 —66.61959 y3 = —.7427243 
3 —66.55867 Yq = —.6930215 
4 —66.54845 ys = —.6310140 
5 —66.54683 Yo = —.5540263 
6 —66.54658 y7 = —.4596696 
7 —66.54654 yg = —.3467526 
8 —66.54653 Yo = —.2165239 
9 —66.54653 Yio = —-0736802 


14.5 AUGMENTED LAGRANGIANS 


One of the most effective general classes of nonlinear programming methods is 
the augmented Lagrangian methods, alternatively referred to as multiplier methods. 
These methods can be viewed as a combination of penalty functions and local duality 
methods; the two concepts work together to eliminate many of the disadvantages 
associated with either method alone. 

The augmented Lagrangian for the equality constrained problem 


minimize f(x) 
(31) 
subject to h(x) =0 


is the function 
1 3 
1.(X, A) = f(x) + ATH(x) + 5 elh()/? 


for some positive constant c. We shall briefly indicate how the augmented 
Lagrangian can be viewed as either a special penalty function or as the basis for a 
dual problem. These two viewpoints are then explored further in this and the next 
section. 

From a penalty function viewpoint the augmented Lagrangian, for a fixed value 
of the vector A, is simply the standard quadratic penalty function for the problem 


minimize f(x) + A’h(x) (2) 
subject to h(x) = 0. 


This problem is clearly equivalent to the original problem (31), since combinations 
of the constraints adjoined to f(x) do not affect the minimum point or the minimum 
value. However, if the multiplier vector were selected equal to A*, the correct 
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Lagrange multiplier, then the gradient of /.(x, A*) would vanish at the solution x*. 
This is because V/.(x, A*) = 0 implies V(x) + (A*)’ Vih(x) + ch(x) Vh(x) = 0 
which is satisfied by V f(x) + (A*)’ Vh(x) = 0 and h(x) = 0. Thus the augmented 
Lagrangian is seen to be an exact penalty function when the proper value of A* is 
used. 

A typical step of an augmented Lagrangian method starts with a vector A,. 
Then x, is found as the minimum point of 


1 
fx) + Ap a(x) + Selh(x) |? (33) 
Next A, is updated to A,,,. A standard method for the update is 
Ae = Ay t+ ch(x,). 


To motivate the adjustment procedure, consider the constrained problem (32) 
with A = A,. The Lagrange multiplier corresponding to this problem is A* — A,, 
where A* is the Lagrange multiplier of (31). On the other hand since (33) is the 
penalty function corresponding to (32), it follows from the results of Section 13.3 
that ch(x,) is approximately equal to the Lagrange multiplier of (32). Combining 
these two facts, we obtain ch(x,) ~ A* — A,. Therefore, a good approximation to 
the unknown A* is A,,, = A, + ch(x;,). 

Although the main iteration in augmented Lagrangian methods is with respect 
to A, the penalty parameter c may also be adjusted during the process. As in ordinary 
penalty function methods, the sequence of c’s is usually preselected; c is either held 
fixed, is increased toward a finite value, or tends(slowly) toward infinity. Since in 
this method it is not necessary for c to go to infinity, and in fact it may remain 
of relatively modest value, the ill-conditioning usually associated with the penalty 
function approach is mediated. 

From the viewpoint of duality theory, the augmented Lagrangian is simply the 
standard Lagrangian for the problem 


. . . 1 
minimize f(x)+ Ae |h(x)|? (34) 


subject to h(x) = 


This problem is equivalent to the original problem (31), since the addition of the term 
+c|h(x)|? to the objective does not change the optimal value, the optimum solution 
point, nor the Lagrange multiplier. However, whereas the original Lagrangian may 
not be convex near the solution, and hence the standard duality method cannot be 
applied, the term +clh(x)|? tends to “convexify” the Lagrangian. For sufficiently 
large c, the Lagrangian will indeed be locally convex. Thus the duality method can 
be employed, and the corresponding dual problem can be solved by an iterative 
process in A. This viewpoint leads to the development of additional multiplier 
adjustment processes. 
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The Penalty Viewpoint 


We begin our more detailed analysis of augmented Lagrangian methods by showing 
that if the penalty parameter c is sufficiently large, the augmented Lagrangian has 
a local minimum point near the true optimal point. This follows from the following 
simple lemma. 


Lemma. Let A and B be nxn symmetric matrices. Suppose that B is 
positive semi-definite and that A is positive definite on the subspace Bx = 0. 
Then there is a c* such that for all c > c* the matrix A+ cB is positive definite. 


Proof. Suppose to the contrary that for every k there were an x, with |x,| = 1 such 
that x/(A + kB)x, <0. The sequence {x,} must have a convergent subsequence 
converging to a limit x. Now since x/ Bx, > 0, it follows that x’ Bx = 0. It also 
follows that x’ AX < 0. However, this contradicts the hypothesis of the lemma. J 


This lemma applies directly to the Hessian of the augmented Lagrangian 
evaluated at the optimal solution pair x*, A*. We assume as usual that the second- 
order sufficiency conditions for a constrained minimum hold at x*, A*. The Hessian 
of the augmented Lagrangian evaluated at the optimal pair x*, A* is 


L,(x*, A*) = F(x*) + (A*)"H(x*) + cVh(x*)’ Vih(x*) 
= L(x*) + cVh(x*)’ Vh(x*). 


The first term, the Hessian of the normal Lagrangian, is positive definite on the 
subspace Vh(x*)x = 0. This corresponds to the matrix A in the lemma. The matrix 
Vh(x*)? Vh(x*) is positive semi-definite and corresponds to B in the lemma. It 
follows that there is a c* such that for all c > c*, L.(x*, A“) is positive definite. 
This leads directly to the first basic result concerning augmented Lagrangians. 


Proposition 1. Assume that the second-order sufficiency conditions for a local 
minimum are satisfied at x*, A*. Then there is a c* such that for all c => c*, the 
augmented Lagrangian |.(x, A*) has a local minimum point at x*. 


By acontinuity argument the result of the above proposition can be extended to 
a neighborhood around x*, A*. That is, for any A near A*, the augmented Lagrangian 
has a unique local minimum point near x*. This correspondence defines a continuous 
function. If a value of A can be found such that h(x(A)) = 0, then that A must in 
fact be A*, since x(A) satisfies the necessary conditions of the original problem. 
Therefore, the problem of determining the proper value of A can be viewed as one 
of solving the equation h(x(A)) = 0. For this purpose the iterative process 


Ap = ARF+ ch(x(A,)), 


is a method of successive approximation. This process will converge linearly in a 
neighborhood around A*, although a rigorous proof is somewhat complex. We shall 
give more definite convergence results when we consider the duality viewpoint. 
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Example 1. Consider the simple quadratic problem studied in Section 13.8 


minimize 2x*+2xy+y*—2y 


subject to x=0. 


The augmented Lagrangian for this problem is 
2 2 1 2 
L(x, y, A) = 2x° + 2xy+y —2y+Ax+ 5 : 


The minimum of this can be found analytically to be x = —(2+A)/(2+c),y= 
(4+c+A)/(2+c). Since h(x, y) = x in this example, it follows that the iterative 
process for A, is 


c(2+A,) 


Neg = Ag I+¢ 


or 


a 2 n 2c 
me NOe, Dae 


This converges to A = —2 for any c > 0. The coefficient 2/(2+c) governs the rate 
of convergence, and clearly, as c is increased the rate improves. 


Geometric Interpretation 


The augmented Lagrangian method can be interpreted geometrically in terms of 
the primal function in a manner analogous to that in Sections 13.3 and 13.8 for 
the ordinary quadratic penalty function and the absolute-value penalty function. 
Consider again the primal function w(y) defined as 


c(y) = min{ f(x) : h(x) = y}, 


where the minimum is understood to be taken locally near x*. We remind the 
reader that w(0) = f(x*) and that Vw(0)? = —A*. The minimum of the augmented 
Lagrangian at step k can be expressed in terms of the primal function as follows: 


min/,(x, A,) = min {f(x) + A7h(x) + seins} 
1 
=min (f(x) + Ary + sclyl? h(x) = y} (35) 
= min {w(y) +Aly+ sly?) 


where the minimization with respect to y is to be taken locally near y = 0. This 
minimization is illustrated geometrically for the case of a single constraint in 
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slope— Ay, 


slope— Ay. 4 1 


Fig. 14.5 Primal function and augmented Lagrangian 


Fig. 14.5. The lower curve represents w(y), and the upper curve represents w(y) + 
Scly|’. The minimum point y, of (30) occurs at the point where this upper curve 
has slope equal to —A,. It is seen that for c sufficiently large this curve will be 
convex at y= 0. If A, is close to A*, it is clear that this minimum point will be 
close to 0; it will be exact if A, = A*. 

The process for updating A, is also illustrated in Fig. 14.5. Note that in general, 
if x, minimizes /,(x, A,), then y, = h(x,) is the minimum point of w(y)+A;y+ 
$cly|?. At that point we have as before 


Vooly,)’ +¢y, = —Ay 
or equivalently, 
Voo(y,)’ = —(Ag + cy,) = —(Ay + ch(x;,)). 
It follows that for the next multiplier we have 
Agu = Ay t ch(x,) = —Voly,)’, 


as shown in Fig. 14.5 for the one-dimensional case. In the figure the next point 
Yz41 is the point where w(y) + Scly|? has slope —A,,,, which will yield a positive 
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value of y,,, in this case. It can be seen that if A, is sufficiently close to A*, then 
A;, Will be even closer, and the iterative process will converge. 


14.6 THE DUAL VIEWPOINT 


In the method of augmented Lagrangians (the method of multipliers), the primary 
iteration is with respect to A, and therefore it is most natural to consider the method 
from the dual viewpoint. This is in fact the more powerful viewpoint and leads to 
improvements in the algorithm. 

As we observed earlier, the constrained problem 


minimize f(x) 
(36) 
subject to h(x) =0 


is equivalent to the problem 


minimize f(x)+ selh(s) (37) 


subject to h(x) =0 


in the sense that the solution points, the optimal values, and the Lagrange multipliers 
are the same for both problems. However, as spelled out by Proposition 1 of the 
previous section, whereas problem (36) may not be locally convex, problem (37) is 
locally convex for sufficiently large c; specifically, the Hessian of the Lagrangian is 
positive definite at the solution pair x*, A*. Thus local duality theory is applicable 
to problem (37) for sufficiently large c. 

To apply the dual method to (37), we define the dual function 


(A) = min{ f(x) + ATA(X) + 5elhx)} (38) 


in a region near x*, A*. If x(A) is the vector minimizing the right-hand side of 
(37), then as we have seen in Section 14.2, h(x(A)) is the gradient of @. Thus the 
iterative process 


Apu = Ay t ch(x(A;)) 


used in the basic augmented Lagrangian method is seen to be a steepest ascent 
iteration for maximizing the dual function @. It is a simple form of steepest ascent, 
using a constant stepsize c. 

Although the stepsize c is a good choice (as will become even more evident 
later), it is clearly advantageous to apply the algorithmic principles of optimization 
developed previously by selecting the stepsize so that the new value of the dual 
function satisfies an ascent criterion. This can extend the range of convergence of 
the algorithm. 
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The rate of convergence of the optimal steepest ascent method (where the 
steplength is selected to maximize @ in the gradient direction) is determined by the 
eigenvalues of the Hessian of ¢. The Hessian of ¢ is found from (15) to be 


Vh(x(A))[L(x(A), A) + cVh(x(A))" Vh(x(A)) |! V(x)”. (39) 


The eigenvalues of this matrix at the solution point x*, A* determine the convergence 
rate of the method of steepest ascent. 
To analyze the eigenvalues we make use of the matrix identity 


cB(A+ cB’B) 'B’ =I— (I+ cBA 'B’)"," 


which is a generalization of the Sherman-Morrison formula. (See Section 10.4.) 
It is easily seen from the above identity that the matrices B(A + cB’B)~'B? and 
(BA 'B’) have identical eigenvectors. One way to see this is to multiply both sides 
of the identity by (I+ cBA'B’) on the right to obtain 


cB(A+ cB’B) 'B’ (1+ cBA 'B’) = cBA“'B’. 


Suppose both sides are applied to an eigenvector e of BA~'B” having eigen- 
value w. Then we obtain 


cB(A+ cB’B)'B’ (1+ cw)e = cwe. 


It follows that e is also an eigenvector of B(A +cB’B)~'B’, and if v is the 
corresponding eigenvalue, the relation 


cu(1+ cw) = cw 


must hold. Therefore, the eigenvalues are related by 


W 


=; 40 
7 l+cw a0) 


The above relations apply directly to the Hessian (39) through the associations 
A = L(x’, A*) and B = Vh(x*). Note that the matrix Vh(x*)L(x*, A*)~! Vh(x*)’, 
corresponding to BA~'B? above, is the Hessian of the dual function of the original 
problem (36). As shown in Section 14.3 the eigenvalues of this matrix determine 
the rate of convergence for the ordinary dual method. Let w and W be the smallest 
and largest eigenvalues of this matrix. From (40) it follows that the ratio of smallest 
to largest eigenvalues of the Hessian of the dual for the augmented problem is 


1 
dg 


1 . 
—+c 
W 
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This shows explicitly how the rate of convergence of the multiplier method depends 
on c. As c goes to infinity, the ratio of eigenvalues goes to unity, implying arbitrarily 
fast convergence. 

Other unconstrained optimization techniques may be applied to the 
maximization of the dual function defined by the augmented Lagrangian; conjugate 
gradient methods, Newton’s method, and quasi-Newton methods can all be used. 
The use of Newton’s method requires evaluation of the Hessian matrix (39). For 
some problems this may be feasible, but for others some sort of approximation is 
desirable. One approximation is obtained by noting that for large values of c, the 
Hessian (39) is approximately equal to (1/c)I. Using this value for the Hessian and 
h(x(A)) for the gradient, we are led to the iterative scheme 


Apu = AQ Ft ch(x(A,)), 


which is exactly the simple method of multipliers originally proposed. 

We might summarize the above observations by the following statement relating 
primal and dual convergence rates. If a penalty term is incorporated into a problem, 
the condition number of the primal problem becomes increasingly poor as c > co 
but the condition number of the dual becomes increasingly good. To apply the dual 
method, however, an unconstrained penalty problem of poor condition number must 
be solved at each step. 


Inequality Constraints 


One advantage of augmented Lagrangian methods is that inequality constraints can 
be easily incorporated. Let us consider the problem with inequality constraints: 


minimize f(x) 


(41) 


subject to g(x) <0, 


where g is p-dimensional. We assume that this problem has a well-defined solution 
x*, which is a regular point of the constraints and which satisfies the second- 
order sufficiency conditions for a local minimum as specified in Section 11.8. This 
problem can be written as an equivalent problem with equality constraints: 


minimize f(x) 
(42) 
subject to g(x) +z; =0, fi — eee oF 


Through this conversion we can hope to simply apply the theory for equality 
constraints to problems with inequalities. 

In order to do so we must insure that (42) satisfies the second-order sufficiency 
conditions of Section 11.5. These conditions will not hold unless we impose a strict 
complementarity assumption that g;(x*) = 0 implies MW; > 0 as well as the usual 
second-order sufficiency conditions for the original problem (41). (See Exercise 10.) 
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With these assumptions we define the dual function corresponding to the 
augmented Lagrangian method as 


(mw) = min (f(x) + D7 {Hjla)(x) + z+ 5¢l3j08 +4,/ Th 


j=l 


For convenience we define v; = a for 7 =1,2,..., p. Then the definition of d(w) 
becomes 


(HW) = min (f(x) +m) + v1+ Sela) +¥P). (43) 


The minimization with respect to v in (43) can be carried out analytically, and this 
will lead to a definition of the dual function that only involves minimization with 
respect to x. The variable vu; enters the objective of the dual function only through 
the expression 


P= pjlgj(x) +¥,]+ x [g)(x) +,’ (44) 


It is this expression that we must minimize with respect to vu; > 0. This is easily 
accomplished by differentiation: If v;> 0, the derivative must vanish; if Upe= 0, 
the derivative must be nonnegative. The derivative is zero at v; = —g;(x) — p,/c. 
Thus we obtain the solution 


MM; , L; 
i g(x) >. GE gj (x) 20 
an c ‘ c 
0) otherwise 


or equivalently, 


v; = max [0. —8;(x)— =!) : (45) 


We now substitute this into (44) in order to obtain an explicit expression for the 
minimum of P). 
For U;,= 0, we have 


Pi= 5 28300) +¢°g,(x)"} 
= Soll + ce OP —03}, 
For uv; = —g;(x) — m/c we have 
P= =e; /26. 


These can be combined into the formula 


P= ~ {[max (0, w;+ cg(x))/° _ L;}. 
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Fig. 14.6 Penalty function for inequality problem 


In view of the above, let us define the function of two scalar arguments ¢ and pw: 


P(t, w) = 5 (Imax (0, 4+ et)? — 1}. (46) 


For a fixed uu > 0, this function is shown in Fig. 14.6. Note that it is a smooth 
function with derivative with respect to ¢ equal to w at r= 0. 
The dual function for the inequality problem can now be written as 


b(u) = min } f(x) +7 rasiada). (47) 


j=l 


Thus inequality problems can be treated by adjoining to f(x) a special penalty 
function (that depends on w). The Lagrange multiplier can then be adjusted to 
maximize @, just as in the case of equality constraints. 


14.7, CUTTING PLANE METHODS 


Cutting plane methods are applied to problems having the general form 


minimize c’x 
j (48) 
subject to xe S, 


where S C E” is a closed convex set. Problems that involve minimization of a 
convex function over a convex set, such as the problem 


minimize f(y) (49) 
subject to yeR, 
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where R C E""! is a convex set and f is a convex function, can be easily converted 
to the form (48) by writing (49) equivalently as 


minimize r 
subject to f(y) —r<0 (50) 
y € R, 


which, with x = (r, y), is a special case of (48). 


General Form of Algorithm 


The general form of a cutting-plane algorithm for problem (48) is as follows: 
Given a polytope P, D S 


Step 1. Minimize c’x over P, obtaining a point x, in P,. If x, € S, stop; x, is 
optimal. Otherwise, 


Step 2. Find a hyperplane H, separating the point x, from S, that is, find a, € E", 
b, € E' such that S c {x:ajx < b,}, x, € {x:a/x > b,}. Update P, to obtain P,.,; 
including as a constraint ayx < by. 


The process is illustrated in Fig. 14.7. 

Specific algorithms differ mainly in the manner in which the hyperplane that 
separates the current point x, from the constraint set S is selected. This selection is, 
of course, the most important aspect of the algorithm, since it is the deepness of the 
cut associated with the separating hyperplane, the distance of the hyperplane from 
the current point, that governs how much improvement there is in the approximation 
to the constraint set, and hence how fast the method converges. 


Fig. 14.7 Cutting plane method 
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Specific algorithms also differ somewhat with respect to the manner by which 
the polytope is updated once the new hyperplane is determined. The most straight- 
forward procedure is to simply adjoin the linear inequality associated with that 
hyperplane to the ones determined previously. This yields the best possible updated 
approximation to the constraint set but tends to produce, after a large number of 
iterations, an unwieldy number of inequalities expressing the approximation. Thus, 
in some algorithms, older inequalities that are not binding at the current point are 
discarded from further consideration. 


Duality 


The general cutting plane algorithm can be regarded as an extended application of 
duality in linear programming, and although this viewpoint does not particularly aid 
in the analysis of the method, it reveals the basic interconnection between cutting 
plane and dual methods. The foundation of this viewpoint is the fact that S can be 
written as the intersection of all the half-spaces that contain it; thus 


S={x:alx<b,ie Nh, 


where J is an (infinite) index set corresponding to all half-spaces containing S. 
With S viewed in this way problem (48) can be thought of as an (infinite) linear 
programming problem. 

Corresponding to this linear program there is (at least formally) the dual 
problem 


maximize > A,b; 
iel 

subject to > Aja; =c (51) 
iel 


A; > 0, iel, 
Selecting a finite subset of J, say 7, and forming 
P={x:alx<b,ier 


gives a polytope that contains S$. Minimizing e’x over this polytope yields a point 
and a corresponding subset of active constraints 7,. The dual problem with the 
additional restriction A; = 0 for i ¢ J, will then have a feasible solution, but this 
solution will in general not be optimal. Thus, a solution to a polytope problem 
corresponds to a feasible but non-optimal solution to the dual. For this reason the 
cutting plane method can be regarded as working toward optimality of the (infinite 
dimensional) dual. 
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14.8 KELLEY’S CONVEX CUTTING PLANE 
ALGORITHM 


The convex cutting plane method was developed to solve convex programming 
problems of the form 


minimize f(x) 
(52) 
subject to g(x)<0, i=1,2,...,p, 


where x € E” and f and the g,’s are differentiable convex functions. As indicated 
in the last section, it is sufficient to consider the case where the objective function 


is linear; thus, we consider the problem 


minimize c’x 


(53) 
subject to g(x) <0 
where x € E” and g(x) € E” is convex and differentiable. 
For g convex and differentiable we have the fundamental inequality 
g(x) > g(w) + Vge(w)(x—w) (54) 


for any x, w. We use this equation to determine the separating hyperplane. Specif- 
ically, the algorithm is as follows: 

Let S = {x : g(x) < 0} and let P be an initial polytope containing S and such 
that c?x is bounded on P. Then 


Step 1. Minimize c’x over P obtaining the point x = w. If g(w) < 0, stop; w is 
an optimal solution. Otherwise, 


Step 2. Let i be an index maximizing g,(w). Clearly g;(w) > 0. Define the new 
approximating polytope to be the old one intersected with the half-space 


{x: g:(w) + Vgi(w)(x—w) < 0}. (55) 
Return to Step 1. 


The set defined by (55) is actually a half-space if Vg;(w) 4 0. However, 
Vg;(w) = 0 would imply that w minimizes g, which is impossible if S is nonempty. 
Furthermore, the half-space given by (55) contains S, since if g(x) < 0 then by (54) 
g,(w) + Vg,(w)(x — Ww) < g,;(x) < 0. The half-space does not contain the point w 
since g,(w) > 0. This method for selecting the separating hyperplane is illustrated 
in Fig. 14.8 for the one-dimensional case. Note that in one dimension, the procedure 
reduces to Newton’s method. 
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> Xx 


Fig. 14.8 Convex cutting plane 


Calculation of the separating hyperplane is exceedingly simple in this algorithm, 
and hence the method really amounts to the solution of a series of linear 
programming problems. It should be noted that this algorithm, valid for any convex 
programming problem, does not involve any line searches. In that respect it is also 
similar to Newton’s method applied to a convex function. 


Convergence 


Under fairly mild assumptions on the convex function, the convex cutting plane 
method is globally convergent. It is possible to apply the general conver- 
gence theorem to prove this, but somewhat easier, in this case, to prove it 
directly. 


Theorem. Let the convex functions g;,i=1,2,..., p be continuously differen- 
tiable, and suppose the convex cutting plane algorithm generates the sequence 
of points {w,}. Any limit point of this sequence is a solution to problem (53). 


Proof. Suppose {w;}, k € K is a subsequence of {w,} converging to w. By 
taking a further subsequence of this, if necessary, we may assume that the index i 
corresponding to Step 2 of the algorithm is fixed throughout the subsequence. Now 
ifke K,k’ € K and k'>k, then we must have 


gi(w,) + Vg;(w,) (Wy — W,) <0, 
which implies that 
gi(w,) <|Va,(w,)||We — Wel (56) 


Since |Vg,(w,)| is bounded with respect to k € K, the right-hand side of (56) goes 
to zero as k and k’ go to infinity. The left-hand side goes to g,(w). Thus g,(w) < 0 
and we see that w is feasible for problem (53). 
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If f* is the optimal value of problem (53), we have c’w, < f* for each k 
since w, is obtained by minimizing over a set containing S. Thus, by continuity, 
c’w < f* and hence w is an optimal solution. J 


As with most algorithms based on linear programming concepts, the rate of 
convergence of cutting plane algorithms has not yet been satisfactorily analyzed. 
Preliminary research shows that these algorithms converge arithmetically, that is, 
if x* is optimal, then |x, —x*|? < c/k for some constant c. This is an exceedingly 
poor type of convergence. This estimate, however, may not be the best possible and 
indeed there are indications that the convergence is actually geometric but with a 
ratio that goes to unity as the dimension of the problem increases. 


14.9 MODIFICATIONS 


In this section we describe the supporting hyperplane algorithm (an alternative 
method for determining a cutting plane) and examine the possibility of dropping 
from consideration some old hyperplanes so that the linear programs do not grow 
too large. 


The Supporting Hyperplane Algorithm 


The convexity requirements are less severe for this algorithm. It is applicable to 
problems of the form 


minimize c?x 
subject to g(x) <0, 


where x € E”, g(x) € E’, the g,’s are continuously differentiable, and the constraint 
region S defined by the inequalities is convex. Note that convexity of the functions 
themselves is not required. We also assume the existence of a point interior to the 
constraint region, that is, we assume the existence of a point y such that g(y) < 0, 
and we assume that on the constraint boundary g,(x) = 0 implies Vg;(x) 4 0. The 
algorithm is as follows: 

Start with an initial polytope P containing S and such that e7x is bounded 
below on S. Then 


Step 1. Determine w = x to minimize c’x over P. If w € S, stop. Otherwise, 


Step 2. Find the point u on the line joining y and w that lies on the boundary 
of S. Let i be an index for which g;(u) = 0 and define the half-space H = {x: 
Vg;(u)(x —u) < 0}. Update P by intersecting with H. Return to Step 1. 


The algorithm is illustrated in Fig. 14.9. 
The price paid for the generality of this method over the convex cutting plane 
method is that an interpolation along the line joining y and w must be executed to 
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Fig. 14.9 Supporting hyperplane algorithm 


find the point u. This is analogous to the line search for a minimum point required 
by most programming algorithms. 


Dropping Nonbinding Constraints 


In all cutting plane algorithms nonbinding constraints can be dropped from the 
approximating set of linear inequalities so as to keep the complexity of the approx- 
imation manageable. Indeed, since n linearly independent hyperplanes determine 
a single point in E”, the algorithm can be arranged, by discarding the nonbinding 
constraints at the end of each step, so that the polytope consists of exactly n linear 
inequalities at every stage. 

Global convergence is not destroyed by this process, since the sequence of 
objective values will still be monotonically increasing. It is not known, however, 
what effect this has on the speed of convergence. 


14.10 EXERCISES 


1. (Linear programming) Use the global duality theorem to find the dual of the linear 
program 
minimize ¢’x 
subject to Ax=b 


x>0. 


Note that some of the regularity conditions may not be necessary for the linear case. 


2. (Double dual) Show that the for a convex programming problem with a solution, the 
dual of the dual is in some sense the original problem. 


10. 


11. 
12. 


13. 
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(Non-convex?) Consider the problem 
minimize xy 
subject to x+y—4>0 
1<x<5, 1l<y<5. 
Show that although the objective function is not convex, the primal function is convex. 
Find the optimal value and the Lagrange multiplier. 
Find the global maximum of the dual function of Example 1, Section 14.2. 


Show that the function ¢ defined for A, p, (uw > 0), by H(A, w) = min, [, f(x) + A7 h(x) + 
p."g(x)] is concave over any convex region where it is finite. 


Prove that the dual canonical rate of convergence is not affected by a change of variables 
in x. 


. Corresponding to the dual function (23): 


a) Find its gradient. 
b) Find its Hessian. 
c) Verify that it has a local maximum at A*, p*. 


Find the Hessian of the dual function for a separable problem. 


Find an explicit formula for the dual function for the entropy problem (Example 3, 
Section 11.4). 


Consider the problems 


minimize f(x) 


(57) 
subject to g(x) <0, aaa Oe ieee 7, 
and 
minimize f(x) 
(58) 
subject to g(x) +z; =0, JH, 23 een De 
a) Let x*, 4}, @3,.-., #4), be a point and set of Lagrange multipliers that satisfy the first- 


order necessary conditions for (57). For x*, u*, write the second-order sufficiency 
conditions for (58). 

b) Show that in general they are not satisfied unless, in addition to satisfying the 
sufficiency conditions of Section 11.8, g;(x*) implies 47 > 0. 


Establish global convergence for the supporting hyperplane algorithm. 


Establish global convergence for an imperfect version of the supporting hyperplane 
algorithm that in interpolating to find the boundary point u actually finds a point 
somewhere on the segment joining u and su+ sw and establishes a hyperplane there. 


Prove that the convex cutting plane method is still globally convergent if it is modified by 
discarding from the definition of the polytope at each stage hyperplanes corresponding 
to inactive linear inequalities. 
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Chapter 15 PRIMAL-DUAL 
METHODS 


This chapter discusses methods that work simultaneously with primal and dual 
variables, in essence seeking to satisfy the first-order necessary conditions for 
optimality. The methods employ many of the concepts used in earlier chapters, 
including those related to active set methods, various first and second order methods, 
penalty methods, and barrier methods. Indeed, a study of this chapter is in a sense 
a review and extension of what has been presented earlier. 

The first several sections of the chapter discuss methods for solving the standard 
nonlinear programming structure that has been treated in the Parts 2 and 3 of the 
text. These sections provide alternatives to the methods discussed earlier. 

Section 15.9 however discusses a completely different form of problem, 
termed semidefinite programming, which evolved from linear programming. These 
problems are characterized by inequalities defined by positive-semidefiniteness of 
matrices. In other words, rather than a restriction of the form x > 0 for a vector 
x, the restriction is of the form A > 0 where A is a symmetric matrix and > 0 
denotes positive semi-definiteness. Such problems are of great practical importance. 
The principle solution method for semidefinite problems are generalizations of the 
interior point methods for linear programming. 


15.1 THE STANDARD PROBLEM 


Consider again the standard nonlinear program 


minimize F(x) (1) 
subject to h(x) =0 
g(x) <0. 


The first-order necessary conditions for optimality are, as we know, 


Vf(x) +A’ Vh(x) +p’ Ve(x) = 0 (2) 
h(x) = 0 
469 
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g(x) < 0 
p' g(x) = 0 


The last requirement is the complementary slackness condition. If it is known which 
of the inequality constraints is active at the solution, these active constraints can be 
rolled into the equality constraints h(x) = 0, and the inactive inequalities along with 
the complementary slackness condition dropped, to obtain a problem with equality 
constraints only. This indeed is the structure of the problem near the solution. 

If in this structure the vector x is n-dimensional and h is m-dimensional, then 
A will also be m-dimensional. The system (1) will, in this reduced form, consist of 
n-+™m equations and n+ m unknowns, which is an indication that the system may 
be well defined, and hence that there is a solution for the pair (x, A). In essence, 
primal—dual methods amount to solving this system of equations, and use additional 
strategies to account for inequality constraints. 

In view of the above observation it is natural to consider whether in fact the 
system of necessary conditions is in fact well conditioned, possessing a unique 
solution (x, A). We investigate this question by considering a linearized version of 
the conditions. 

A useful and somewhat more generally useful approach is to consider the 
quadratic program 


minimize }x’Qx+¢"x (3) 
subject to Ax=b, 


where x is n-dimensional and b is m-dimensional. 
The first-order conditions for this problem are 


Qx+ATA+c=0 
Ax —b=0. (4) 


These correspond to the necessary conditions (2) for equality constraints only. The 
following proposition gives conditions under which the system is nonsingular. 


Proposition. Let Qand A be nxnand mx n matrices, respectively. Suppose 
that A has rank m and that Q is positive definite on the subspace M = {x: 


Ax = 0}. Then the matrix 
At 
eal (5) 


Proof. | Suppose (x, y) € E”*” is such that 


is nonsingular. 


Qx+A’y=0 


Ax =0. ©) 
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Multiplication of the first equation by x’ yields 
x’Qx+x’A’y=0, 


and substitution of Ax = 0 yields x’Qx = 0. However, clearly x € M, and thus 
the hypothesis on Q together with x7Qx = 0 implies that x = 0. It then follows 
from the first equation that A7y = 0. The full-rank condition on A then implies that 
y = 0. Thus the only solution to (6) isx =0, y=0. ff 


If, as is often the case, the matrix Q is actually positive definite (over the whole 
space), then an explicit formula for the solution of the system can be easily derived 
as follows: From the first equation in (4) we have 


x=-Q'A’A-Q''e. 

Substitution of this into the second equation then yields 

—AQ'ATA—AQ'ce—b=0, 
from which we immediately obtain 

A= —(AQ™'A’)'[AQ™'c +b] (7) 
and 

x = Q7'A7(AQ"!A’)“'[AQ™'c +b] —Qu'e 
=-Q'[I—A’(AQ™1A’)'AQ™' Je (8) 
+Q'A’(AQ'A’)'b. 


15.2 STRATEGIES 


There are some general strategies that guide the development of the primal—dual 
methods of this chapter. 


1. Descent measures. A fundamental concept that we have frequently used is 
that of assuring that progress is made at each step of an iterative algorithm. 
It is this that is used to guarantee global convergence. In primal methods this 
measure of descent is the objective function. Even the simplex method of linear 
programming is founded on this idea of making progress with respect to the 
objective function. For primal minimization methods, one typically arranges that 
the objective function decreases at each step. 
The objective function is not the only possible way to measure progress. We 
have, for example, when minimizing a function f, considered the quantity 
(1/2)|Vf(x)|?, seeking to monotonically reduce it to zero. 


In general, a function used to measure progress is termed a merit function. 
Typically, it is defined so as to decrease as progress is made toward the solution 
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of a minimization problem, but the sign may be reversed in some definitions. 
For primal—dual methods, the merit function may depend on both x and A. 


One especially useful merit function for equality constrained problems is 
1 F 2,1 2 
m(x, X) = 5|Vflx) +A Vh(x)|" + 5 |h(x))/. 


It is examined in the next section. 


We shall examine other merit functions later in the chapter. With interior point 
methods or semidefinite programming, we shall use a potential function that 
serves as a merit function. 


2. Active Set Methods. Inequality constraints can be treated using active set 
methods that treat the active constraints as equality constraints, at least for 
the current iteration. However, in primal—dual methods, both x and A are 
changed. We shall consider variations of steepest descent, conjugate directions, 
and Newton’s method where movement is made in the (x, A) space. 


3. Penalty Functions. In some primal—dual methods, a penalty function can serve 
as a merit function, even though the penalty function depends only on x. This 
is particularly attractive for recursive quadratic programming methods where a 
quadratic program is solved at each stage to determine the direction of change 
in the pair (x, A). 


4. Interior (Barrier) Methods. Barrier methods lead to methods that move within 
the relative interior of the inequality constraints. This approach leads to the 
concept of the primal—dual central path. These methods are used for semidefinite 
programming since these problems are characterized as possessing a special 
form of inequality constraint. 


15.3 A SIMPLE MERIT FUNCTION 


It is very natural, when considering the system of necessary conditions (2), to form 
the function 


m(x, A) = SIV (0) FAT UHC)? + 5 I(X)P (0) 


and use it as a measure of how close a point (x, A) is to a solution. 

It must be noted, however, that the function m(x, A) is not always well-behaved; 
it may have local minima, and these are of no value in a search for a solution. The 
following theorem gives the conditions under which the function m(x, A) can serve 
as a well-behaved merit function. Basically, the main requirement is that the Hessian 
of the Lagrangian be positive definite. As usual, we define /(x, A) = f(x) + A7h(x). 


Theorem. Let f andh be twice continuously differentiable functions on E" of 
dimension I and m, respectively. Suppose that x* and X* satisfy the first-order 
necessary conditions for a local minimum of m(x, A) = 3 |V f(x) +A" Vh(x) |? + 
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$|h(x)|? with respect to x and A. Suppose also that at x*, A*, (i) the rank 
of V(x") is m and (ii) the Hessian matrix L(x*, A*) = F(x*) + A*"H(x*) is 
positive definite. Then, x*, X* is a (possibly nonunique) global minimum point 
of m(x, A), with value m(x*, A*) = 0. 


Proof. Since x*, A* satisfies the first-order conditions for a local minimum point 
of m(x, A), we have 


[Vf(x*) + A*? Vh(x*) |L(x*, A*) + h(x*)’ Vh(x*) = 0 (10) 
[Vf(x*) + A*’ Vh(x*)|Vh(x*)? = 0. (11) 


Multiplying (10) on the right by [Vf(x*) + A*’ Vh(x*)]’ and using (11) we obtain‘ 
V1(x*, A*)L(x*, A*)V1(x", a‘)? =. 


Since L(x*, A*) is positive definite, this implies that V/(x*, A*) = 0. Using this in 
(10), we find that h(x*)? Vh(x*) = 0, which, since Wh(x*) is of rank m, implies 
that h(x*) = 0. ff 


The requirement that the Hessian of the Lagrangian L(x*, A*) be positive 
definite at a stationary point of the merit function m is actually not too restrictive. 
This condition will be satisfied in the case of a convex programming problem where 
f is strictly convex and h is linear. Furthermore, even in nonconvex problems one 
can often arrange for this condition to hold, at least near a solution to the original 
constrained minimization problem. If it is assumed that the second-order sufficiency 
conditions for a constrained minimum hold at x*, A*, then L(x*, A*) is positive 
definite on the subspace that defines the tangent to the constraints; that is, on the 
subspace defined by Vh(x*)x = 0. Now if the original problem is modified with a 
penalty term to the problem 


1 : 
minimize x) + =clh(x)|- 
f(x) + selh(x)| > 

subject to h(x) =0, 


the solution point x* will be unchanged. However, as discussed in Chapter 14, 
the Hessian of the Lagrangian of this new problem (12) at the solution point is 
L(x*, A*) + cVh(x*)? Vh(x*). For sufficiently large c, this matrix will be positive 
definite. Thus a problem can be “convexified” (at least locally) before the merit 
function method is employed. 

An extension to problems with inequality constraints can be defined by parti- 
tioning the constraints into the two groups active and inactive. However, at this 
point the simple merit function for problems with equality constraints is adequate 
for the purpose of illustrating the general idea. 


* Unless explicitly indicated to the contrary, the notation V/(x, A) refers to the gradient of 
1 with respect to x, that is, V,/(x, A). 
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15.4 BASIC PRIMAL—-DUAL METHODS 


Many primal—dual methods are patterned after some of the methods used in earlier 
chapters, except of course that the emphasis is on equation solving rather than 
explicit optimization. 


First-Order Method 


We consider first a simple straightforward approach, which in a sense parallels 
the idea of steepest descent in that it uses only a first-order approximation to the 
primal—dual equations. It is defined by 


Xi =X — 4, V(X, A,’ 


(13) 
Ag =A, t+ a, h(x), 


where @, is not yet determined. This is based on the error in satisfying (2). Assume 
that the Hessian of the Lagrangian L(x, A) is positive definite in some compact 
region of interest, and consider the simple merit function 


m(x, A) = IVlCx, A)? + lhc (14) 


discussed above. We would like to determine whether the direction of change in 
(13) is a descent direction with respect to this merit function. The gradient of the 
merit function has components corresponding to x and A of 


Vi(x, A)L(x, A) + h(x)’ Vh(x) 


(15) 
Vi(x, A)Vh(x)’. 
Thus the inner product of this gradient with the direction vector having components 
—VI(x, A)’, h(x) is 


—VI(x, A)L(x, A)VI(x, A)’ — h(x)’ Vh(x) V/(x, A)’ + V1(x, A)Vh(x)h(x) 
= —VI(x, A)L(x, A)VI(x, A)’ <0. 


This shows that the search direction is in fact a descent direction for the merit 
function, unless V/(x, A) = 0. Thus by selecting a, to minimize the merit function 
in the search direction at each step, the process will converge to a point where 
Vi(x, A) = 0. However, there is no guarantee that h(x) = 0 at that point. 

We can try to improve the method either by changing the way in which 
the direction is selected or by changing the merit function. In this case a slight 
modification of the merit function will work. Let 


w(x, A, y) = m(x, A) — y[f(x) + ATH(X)] 
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for some y > 0. We then calculate that the gradient of w has the two components 
corresponding to x and A 


V(x, A)L(x, A) + h(x)’ Vh(x) — yV/(x, A) 
V1(x, A) Vh(x)’ — yh(x)’, 
and hence the inner product of the gradient with the direction — V/(x, A)", h(x) is 
—VI1(x, A)[L(x, A) — yI]VI(x, A)? — y|h(x)|?. 


Now since we are assuming that L(x, A) is positive definite in a compact region of 
interest, there is a y > 0 such that L(x, A) — yI is positive definite in this region. 
Then according to the above calculation, the direction —V/(x, A)", h(x) is a descent 
direction, and the standard descent method will converge to a solution. This method 
will not converge very rapidly however. (See Exercise 2 for further analysis of this 
method.) 


Conjugate Directions 


Consider the quadratic program 


1 
minimize 5x Ox —b’x 


(16) 
subject to Ax=c. 
The first-order necessary conditions for this problem are 
Qx+A’A=b 
(17) 
Ax =c. 


As discussed in the previous section, this problem is equivalent to solving a system 
of linear equations whose coefficient matrix is 


Q A’ 
m=(9 ‘A (18) 


This matrix is symmetric, but it is not positive definite (nor even semidefinite). 
However, it is possible to formally generalize the conjugate gradient method to 
systems of this type by just applying the conjugate-gradient formulae (17)-(20) of 
Section 9.3 with Q replaced by M. A difficulty is that singular directions (defined 
as directions p such that p’ Mp = 0) may occur and cause the process to break down. 
Procedures for overcoming this difficulty have been developed, however. Also, 
as in the ordinary conjugate gradient method, the approach can be generalized to 
treat nonquadratic problems as well. Overall, however, the application of conjugate 
direction methods to the Lagrange system of equations, although very promising, 
is not currently considered practical. 
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Newton’s Method 

Newton’s method for solving systems of equations can be easily applied to the 

Lagrange equations. In its most straightforward form, the method solves the system 
Vi(x, A) =0 


h(x) = 0 


by solving the linearized version recursively. That is, given x,, A, the new point 
X41, Az ,, 18 determined from the equations 


V(x, A," +(x, A, Jd + Vh(x,)"y, =0 


(20) 
h(x,) + Vh(x;)d; =0 
by setting xX,.,; = xX, +d, Az,; = A, + y,. In matrix form the above Newton 
equations are 


—VI1(x,, A,)* 
-| —h(x,) | a) 


The Newton equations have some important structural properties. First, we 
observe that by adding Vh(x,)’A, to the top equation, the system can be trans- 
formed to the form 


L(x, Ax) Vax)" | | di —Vf(x,)" 
Eres 0 bal = es , (22) 


L(x,, A,) Vh(x,)" | | d, 
[ena Is] 


where again A,,; = A, +y;,. In this form A, appears only in the matrix L(x,, A,). 
This conversion between (21) and (22) will be useful later. 

Next we note that the structure of the coefficient matrix of (21) or (22) is 
identical to that of the Proposition of Section 15.1. The standard second-order 
sufficiency conditions imply that Wh(x*) is of full rank and that L(x*, A*) is 
positive definite on M = {x : Vh(x*)x = 0} at the solution. By continuity these 
conditions can be assumed to hold in a region near the solution as well. Under 
these assumptions it follows from Proposition | that the Newton equation (21) has 
a unique solution. 

It is again worthwhile to point out that, although the Hessian of the Lagrangian 
need be positive definite only on the tangent subspace in order for the system (21) 
to be nonsingular, it is possible to alter the original problem by incorporation of 
a quadratic penalty term so that the new Hessian of the Lagrangian is L(x, A) + 
cVh(x)! Vh(x). For sufficiently large c, this new Hessian will be positive definite 
over the entire space. 

If L(x, A) is positive definite (either originally or through the incorporation 
of a penalty term), it is possible to write an explicit expression for the solution of 
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the system (21). Let us define L, = L(x,, A,), A, = Vh(x,), 1, = V(x, A,)7, hy = 
h(x,). The system then takes the form 


L,d, + Ary, =k 

A,d, = —h,. Cy 

The solution is readily found, as in (7) and (8) for quadratic programming, to be 
ye = (AL, " Ap)" [hy — AL "he (24) 


d, = —L,"[1— Ap (AL, Ap) AL Wh — By 'Ap (AL; Ap) hy. (25) 


There are standard results concerning Newton’s method applied to a system 
of nonlinear equations that are applicable to the system (19). These results state 
that if the linearized system is nonsingular at the solution (as is implied by our 
assumptions) and if the initial point is sufficiently close to the solution, the method 
will in fact converge to the solution and the convergence will be of order at least two. 
To guarantee convergence from remote initial points and hence be more broadly 
applicable, it is desirable to use the method as a descent process. Fortunately, we 
can show that the direction generated by Newton’s method is a descent direction 
for the simple merit function 


1 i 
m(x, A) = 5|VUx, AY)? + 5|h(x)]?. 


Given d,, y, satisfying (23), the inner product of this direction with the gradient of 
m at X,, A, is, referring to (15), 


[L,], + Avh,, Aj], ]"[d,, yd = i L,d, +h; A;d, + ALY: 
= —|h)? - |h,|*. 


This is strictly negative unless both 1, = 0 and h, = 0. Thus Newton’s method has 
desirable global convergence properties when executed as a descent method with 
variable step size. 

Note that the calculation above does not employ the explicit formulae (24) 
and (25), and hence it is not necessary that L(x, A) be positive definite, as long as 
the system (21) is invertible. We summarize the above discussion by the following 
theorem. 


Theorem. Define the Newton process by 


Xp) = X& + dy 


Neg = ALT AYE, 


where d;,, y, are solutions to (24) and where a, is selected to minimize the 
merit function 


I 1 
m(x, A) = 5|VU(x, A)? + 5|hx))?. 
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Assume that d,, y; exist and that the points generated lie in a compact set. Then 
any limit point of these points satisfies the first-order necessary conditions for 
a solution to the constrained minimization problem (1). 


Proof. Most of this follows from the above observations and the Global Conver- 
gence Theorem. The one-dimensional search process is well-defined, since the merit 
function m is bounded below. J 


In view of this result, it is worth pursuing Newton’s method further. We would 
like to extend it to problems with inequality constraints. We would also like to 
avoid the necessity of evaluating L(x,, A,) at each step and to consider alternative 
merit functions—perhaps those that might distinguish a local maximum from a 
local minimum, which the simple merit function does not do. These considerations 
guide the developments of the next several sections. 


Relation to Quadratic Programming 


It is clear from the development of the preceding discussion that Newton’s method 
is closely related to quadratic programming with equality constraints. We explore 
this relationship more fully here, which will lead to a generalization of Newton’s 
method to problems with inequality constraints. 

Consider the problem 


1 
minimize 1fd,+ di Lid (26) 


subject to A,d,+h, =0. 


The first-order necessary conditions of this problem are exactly (21), or equivalently 
(23), where y, corresponds to the Lagrange multiplier of (26). Thus, the solution 
of (26) produces a Newton step. 

Alternatively, we may consider the quadratic program 


ae 1 
minimize VAC) + 54 Lid) (27) 


subject to A,d,+h,=9. 


The necessary conditions of this problem are exactly (22), where A,,., now corre- 
sponds to the Lagrange multiplier of (27). The program (27) is obtained from (26) 
by merely subtracting Aj A, a, from the objective function; and this change has no 
influence on d,, since A,d, is fixed. 

The connection with quadratic programming suggests a procedure for extending 
Newton’s method to minimization problems with inequality constraints. Consider 
the problem 


minimize f(x) 
subject to h(x) =0 
g(x) <0. 
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Given an estimated solution point x, and estimated Lagrange multipliers A,, p,, 
one solves the quadratic program 


1 
minimize Wf(x,)d,+ 5a Lid, 
subject to Vh(x,)d,+h, =0 (28) 
Vg(x,)d, +8. <0, 


where L, = F(x,) +A, H(x,) +p? G(x;,), h, = h(x,), 8, = 2(x,). The new point is 
determined by x,,, = X,+d,, and the new Lagrange multipliers are the Lagrange 
multipliers of the quadratic program (28). This is the essence of an early method for 
nonlinear programming termed SOLVER. It is a very attractive procedure, since it 
applies directly to problems with inequality as well as equality constraints without 
the use of an active set strategy (although such a strategy might be used to solve 
the required quadratic program). Methods of this general type, where a quadratic 
program is solved at each step, are referred to as recursive quadratic programming 
methods, and several variations are considered in this chapter. 

As presented here the recursive quadratic programming method extends 
Newton’s method to problems with inequality constraints, but the method has limita- 
tions. The quadratic program may not always be well-defined, the method requires 
second-order derivative information, and the simple merit function is not a descent 
function for the case of inequalities. Of these, the most serious is the requirement 
of second-order information, and this is addressed in the next section. 


15.5 MODIFIED NEWTON METHODS 


A modified Newton method is based on replacing the actual linearized system by 
an approximation. 
First, we concentrate on the equality constrained optimization problem 


minimize f(x) 
(29) 
subject to h(x) =0 


in order to most clearly describe the relationships between the various approaches. 
Problems with inequality constraints can be treated within the equality constraint 
framework by an active set strategy or, in some cases, by recursive quadratic 
programming. 

The basic equations for Newton’s method can be written 


-1 
GsileLi}-«[eo"| [a] 
Ais A, ‘TA, 0 h, |° 
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where as before L, is the Hessian of the Lagrangian, A, = Wh(x,), 1, = [Vf(x,) + 
A; Vh(x,)]", h, = h(x,). A structured modified Newton method is a method of the 


form 
-1 
Xea1 |_| Xe] B, Ay I, 
i | 7 li = [a 0 hy | eo) 
where B, is an approximation to L,. The term “structured” derives from the fact that 
only second-order information in the original system of equations is approximated; 


the first-order information is kept intact. 
Of course the method is implemented by solving the system 


B,d, + Ary, = —h 
A.d, =ah, Gy) 


for d, and y, and then setting x,.; = x, +a,d,, Ay; = A, + ay, for some value 
of a,. In this section we will not consider the procedure for selection of a,, and 
thus for simplicity we take a, = 1. The simple transformation used earlier can be 
applied to write (31) in the form 


B,d, + Al Ag = —Vf(x,)? (32) 
A,d, = —-h,. 
Then x,,, =x,+4,, and A,,, is found directly as a solution to system (32). 
There are, of course, various ways to choose the approximation B,. One is to 
use a fixed, constant matrix throughout the iterative process. A second is to base 
B, on some readily accessible information in L(x,, A,), such as setting B, equal to 
the diagonal of L(x,, A,). Finally, a third possibility is to update B, using one of 
the various quasi-Newton formulae. 
One important advantage of the structured method is that B, can be taken to 
be positive definite even though L, is not. If this is done, we can write the explicit 
solution 


Yi, = (A,B Ap) '[hy — A, Beh] (33) 
d, = —B;'[I a Al (A,B; 'Ay) ‘A,B; Ih, —B, "A; (A,B, Ap) ‘hy. (34) 


Quadratic Programming 
Consider the quadratic program 


1 
minimize V/f(x,)d,-+—d/B,d, 
f%)d + 5d; Bid, (35) 


subject to A,d,+h(x,) = 90. 
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The first-order necessary conditions for this problem are 


B,d, + ALA + Vix)" =0 


(36) 
A,d, = —h(x,), 
which are again identical to the system of equations of the structured modified 
Newton method—in this case in the form (33). The Lagrange multiplier of the 
quadratic program is A,,,;. The equivalence of (35) and (36) leads to a recursive 
quadratic programming method, where at each x, the quadratic program (35) is 
solved to determine the direction d,. In this case an arbitrary symmetric matrix B, 
is used in place of the Hessian of the Lagrangian. Note that the problem (35) does 
not explicitly depend on A,; but B,, often being chosen to approximate the Hessian 
of the Lagrangian, may depend on A,. 
As before, a principal advantage of the quadratic programming formulation 
is that there is an obvious extension to problems with inequality constraints: One 
simply employs a linearized version of the inequalities. 


15.6 DESCENT PROPERTIES 


In order to ensure convergence of the structured modified Newton methods of the 
previous section, it is necessary to find a suitable merit function—a merit function 
that is compatible with the direction-finding algorithm in the sense that it decreases 
along the direction generated. We must abandon the simple merit function at this 
point, since it is not compatible with these methods when B, 4 L,. However, two 
other penalty functions considered earlier, the absolute-value exact penalty function 
and the quadratic penalty function, are compatible with the modified Newton 
approach. 


Absolute-Value Penalty Function 


Let us consider the constrained minimization problem 


minimize f(x) 
(37) 
subject to g(x) <0, 


where g(x) is r-dimensional. For notational simplicity we consider the case of 
inequality constraints only, since it is, in fact, the most difficult case. The extension 
to equality constraints is straightforward. In accordance with the recursive quadratic 
programming approach, given a current point x, we select the direction of movement 
d by solving the quadratic programming problem 


1 
minimize —d’Bd+V/f(x)d 
: ss) as 


subject to Vg(x)d+ g(x) <0, 
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where B is positive definite. 
The first-order necessary conditions for a solution to this quadratic program 
are 


Bd+ V/(x)’ + Vg(x)’p =0 (39a) 
Vg(x)d+ g(x) <0 (39b) 

pb’ [Vg(x)d + g(x)] =0 (39c) 
w>o. (39d) 


Note that if the solution to the quadratic program has d = 0, then the point x, 
together with pw from (39), satisfies the first-order necessary conditions for the 
original minimization problem (37). The following proposition is the fundamental 
result concerning the compatibility of the absolute-value penalty function and the 
quadratic programming method for determining the direction of movement. 


Proposition 1. Let d, w (with d 40) be a solution of the quadratic program 
(38). Then if c > max(u,;), the vector d is a descent direction for the penalty 
j 


function 
P(x) = f(x) +e) 8,(x)". 
j=l 
Proof. Let J(x) = {j: g;(x) > 0}. Now for a > 0, 


P(x+ad) = f(x+ad)+ Sy gj(xt+ad)* 


j=l 


= f(x) +aVf(x)d+c) [g,(x) + aVg,(x)d]" + 0(a) 


= f(x) +aVf(x)d+ e¥ g(a) bac y Vgj(x)d + 0(a) 


j=l Jed(x) 


= P(x)+aVf(x)d+ac }° Vg)(x)d+o(a). (40) 


JEJ(x) 


Where (39b) was used in the third line to infer that Vg,(x) < 0 if g;(x) = 0. Again 
using (39b) we have 


ey Vg(x)d <c > —g (x)= —c)) g)(x)*. (41) 
Jed(x) jeH(x) j=l 
Using (39a) we have 


j=l 
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which by using the complementary slackness condition (39c) leads to 


Vi (x)d = —d"Bd + ¥° 4 ;8;(x) < —d™Bd +) p;8;(x)* 
=i = (42) 
< —d’Bd + max (u,) )°g,(x)*. 


j=l 
Finally, substituting (41) and (42) in (40), we find 


P(x+ad) < P(x) +a{—d’Bd — [c — max (v;)] 3 8; (x)*}+0(a), 


j=l 


Since B is positive definite and c > max(j;), it follows that for @ sufficiently small, 
P(x+ad) < P(x).ff 


The above proposition is exceedingly important, for it provides a basis for estab- 
lishing the global convergence of modified Newton methods, including recursive 
quadratic programming. The following is a simple global convergence result based 
on the descent property. 


Theorem. Let B be positive definite and assume that throughout some compact 

region C E", the quadratic program (38) has a unique solution d, ya such that 

at each point the Lagrange multipliers satisfy max(w;) < c. Let the sequence 
joo: 


{x,} be generated by 
X41 =X + ad, 


where d, is the solution to (38) at x, and where a, minimizes P(X,,,). Assume 
that each x, € Q. Then every limit point x of {x,} satisfies the first-order 
necessary conditions for the constrained minimization problem (37). 


Proof. The solution to a quadratic program depends continuously on the data, 
and hence the direction determined by the quadratic program (38) is a continuous 
function of x. The function P(x) is also continuous, and by Proposition 1, it follows 
that P is a descent function at every point that does not satisfy the first-order 
conditions. The result thus follows from the Global Convergence Theorem. J 


In view of the above result, recursive quadratic programming in conjunction 
with the absolute-value penalty function is an attractive technique. There are, 
however, some difficulties to be kept in mind. First, the selection of the parameter 
a, requires a one-dimensional search with respect to a nondifferentiable function. 
Thus the efficient curve-fitting search methods of Chapter 8 cannot be used without 
significant modification. Second, use of the absolute-value function requires an 
estimate of an upper bound for y1;’s, so that c can be selected properly. In some 
applications a suitable bound can be obtained from previous experience, but in 
general one must develop a method for revising the estimate upward when necessary. 
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Another potential difficulty with the quadratic programming approach above 
is that the quadratic program (38) may be infeasible at some point x,, even though 
the original problem (37) is feasible. If this happens, the method breaks down. 
However, see Exercise 8 for a method that avoids this problem. 


The Quadratic Penalty Function 


Another penalty function that is compatible with the modified Newton method 
approach is the standard quadratic penalty function. It has the added technical 
advantage that, since this penalty function is differentiable, it is possible to apply 
our earlier analytical principles to study the rate of convergence of the method. 
This leads to an analytical comparison of primal-dual methods with the methods of 
other chapters. 

We shall restrict attention to the problem with equality constraints, since that is 
all that is required for a rate of convergence analysis. The method can be extended 
to problems with inequality constraints either directly or by an active set method. 
Thus we consider the problem 


minimize f(x) 


43 
subject to h(x) =0 - 
and the standard quadratic penalty objective 
1 2 
Playa cin (44) 


From the theory in Chapter 13, we know that minimization of the objective with 
a quadratic penalty function will not yield an exact solution to (43). In fact, the 
minimum of the penalty function (44) will have ch(x) ~ A, where A is the Lagrange 
multiplier of (43). Therefore, it seems appropriate in this case to consider the 
quadratic programming problem 


1 
inimi ~d’Bd+ Vf(x)d 
minimize 5 + Vf(x) (45) 


subject to Vh(x)d+h(x) = A/c, 


where A is an estimate of the Lagrange multiplier of the original problem. A 
particularly good choice is 


A=[C1/c)1+Q]'[h(x) — AB Vf(x)"], (46) 


where A = Vh(x), Q = AB"'A’ which is the Lagrange multiplier that would 
be obtained by the quadratic program with the penalty method. The proposed 
method requires that A be first estimated from (46) and then used in the quadratic 
programming problem (45). 
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The following proposition shows that this procedure produces a descent 
direction for the quadratic penalty objective. 


Proposition 2. For any c > 0, let d,A (with d #0) be a solution to the 
quadratic program (45). Then d is a descent direction of the function P(x) = 


f(x) + 1/2)e| h(x)’. 


Proof. We have from the constraint equation 
Ad = (1/c)A—h(x), 
which yields 
cA’ Ad = A7A—cA'h(x). 


Solving the necessary conditions for (45) yields (see the top part of (9) for a similar 
expression with Q = B there) 


Bd = A’Q™'[AB | V£(x)’ + (1/c)A —h(x)] — V(x)’. 
Therefore, 


(B+ cA’A)d = A’Q'[AB'Vf(x)’ —h(x)] 
+ A™[(1/c)Q7! +I]A — V(x)? — cATh(x) 
=A'Q {AB V(x)" — h(x) + ((1/c)1+Q)A} 
— Vf(x)? — cATh(x) 
= —Vf(x)? — cATh(x) = —VP(x)’. 


The matrix (B+ cA7A) is positive definite for any c > 0. It follows that 
VP(x)d < 0.9 


15.7 RATE OF CONVERGENCE 


It is now appropriate to apply the principles of convergence analysis that have been 
repeatedly emphasized in previous chapters to the recursive quadratic programming 
approach. We expect that, if this new approach is well founded, then the rate of 
convergence of the algorithm should be related to the familiar canonical rate, which 
we have learned is a fundamental measure of the complexity of the problem. If 
it is not so related, then some modification of the algorithm is probably required. 
Indeed, we shall find that a small but important modification is required. 
From the proof of Proposition 2 of Section 15.6, we have the formula 


(B+cA7A)d = —VP(x)’, 
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which can be written as 
d= —(B+ cA’A)!VP(x)’. 


This shows that the method is a modified Newton method applied to the uncon- 
strained minimization of P(x). From the Modified Newton Method Theorem of 
Section 10.1, we see immediately that the rate of convergence is determined by the 
eigenvalues of the matrix that is the product of the coefficient matrix (B+ cA7A)7! 
and the Hessian of the function P at the solution point. The Hessian of P is 
(L+cA‘A), where L = F(x) + ch(x)’H(x). We know that the vector ch(x) at 
the solution of the penalty problem is equal to A,, where V(x) + A! Vh(x) = 0. 
Therefore, the rate of convergence is determined by the eigenvalues of 


(B+cA7A)-'(L+cA‘A), (47) 


where all quantities are evaluated at the solution to the penalty problem and L = 
F+ ATH. For large values of c, all quantities are approximately equal to the values 
at the optimal solution to the constrained problem. 

Now what we wish to show is that as c > oo, the matrix (47) looks like By Ly 
on the subspace, M, and like the identity matrix on M+, the subspace orthogonal 
to M. To do this in detail, let C be an n x (n—m) matrix whose columns form an 
orthonormal basis for M, the tangent subspace {x : Ax = 0}. Let D= A’(AA‘)"!. 
Then AC=0, AD=I, C'C=I, C’7D=0. 

The eigenvalues of (B+ cA?A)~!(L+cA7‘A) are equal to those of 


[C, D]-'(B+ cA7A)~|{[C, D]"}-'[C, D]’(L+cA7A)[C, D] 


_[C™BC C™BD ]'[C’LC C’LD 
~| D’BCD’BC+cl| | D’LCD’LD+cl |’ 


Now as c —> oo, the matrix above approaches 


es BCL nm 
0 I ; 


where B,, = C’BC, L,, = C7LC (see Exercise 6). The eigenvalues of this matrix 
are those of Bj;'Ly together with those of I. This analysis leads directly to the 
following conclusion: 


Theorem. Let a, A be the smallest and largest eigenvalues, respectively, 
of By} Ly and assume that a<1< A. Then the structured modified Newton 
method with quadratic penalty function has a rate of convergence no greater 
than [(A—a)/(A+a)}° as c> o. 


In the special case of B =I, the rate in the above proposition is precisely 
the canonical rate, defined by the eigenvalues of L restricted to the tangent plane. 
It is important to note, however, that in order for the rate of the theorem to be 
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h(x) = h(x,) 


AT (+p) 


h=0 


Fig. 15.1 Decomposition of the direction d 


achieved, the eigenvalues of B;,'L,, must be spread around unity; if not, the rate 
will be poorer. Thus, even if Ly, is well-conditioned, but the eigenvalues differ 
greatly from unity, the choice B =I may be poor. This is an instance where proper 
scaling is vital. (We also point out that the above analysis is closely related to that 
of Section 13.4, where a similar conclusion is obtained.) 

There is a geometric explanation for the scaling property. Take B =I for 
simplicity. Then the direction of movement d is d = —V f(x)’ + A’A for some A. 
Using the fact that the projected gradient is p= V(x)’ + A’ for some p, we see 
that d= —p+A’(A+ yp). Thus d can be decomposed into two components: one in 
the direction of the projected negative gradient, the other in a direction orthogonal to 
the tangent plane (see Fig. 15.1). Ideally, these two components should be in proper 
proportions so that the constraint surface is reached at the same point as would be 
reached by minimization in the direction of the projected negative gradient. If they 
are not, convergence will be poor. 


15.8 INTERIOR POINT METHODS 


The primal—dual interior-point methods discussed for linear programming in 
Chapters 5 are, as mentioned there, closely related to the barrier methods presented 
in Chapter 13 and the primal—dual methods of the current chapter. They can be 
naturally extended to solve nonlinear programming problems while maintaining 
both theoretical and practical efficiency. 

Consider the inequality constrained problem 


minimize f(x) 


subject to Ax = b, (48) 
g(x) <0, 
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In general, a weakness of the active constraint method for such a problem is the 
combinatorial nature of determining which constraints should be active. 


Logarithmic Barrier Method 


A method that avoids the necessity to explicitly select a set of active constraints 
is based on the logarithmic barrier method, which solves a sequence of equality 
constrained minimization problems. Specifically, 


minimize f(x) — 3 log(—g;(x)) 
j=l (49) 


subject to Ax = b, 


where w = pw >0,k=1,...,, we > w**!, uw* > 0. The p*s can be pre-determined. 
Typically, we have ‘+! = yuk for some constant 0 < y < 1. Here, we also assume 
that the original problem has a feasible interior-point x°; that is, 


Ax’=b and g(x’) <0, 


and A has full row rank. 
For fixed mw, and using S; = w/g;, the optimality conditions of the barrier 
problem (49) are: 


—Sg(x) =pl 
Ax =b (50) 
—ATy + Vf(x)" + Va(x)"s=0, 


where S = diag(s); that is, a diagonal matrix whose diagonal entries are s, and 
Vg(x) is the Jacobian matrix of g(x). 

If f(x) and g,(x) are convex functions for all i, f(x) — w>-;log(—g;(x)) is 
strictly convex in the interior of the feasible region, and the objective level set is 
bounded, then there is a unique minimizer for the barrier problem. Let (x(w) > 
0, y(), Sw) > 0) be the (unique) solution of (50). Then, these values form the 
primal-dual central path of (48): 


C= {(x(u), y(u), S(u) > 0): 0< p< oo}. 
This can be summarized in the following theorem. 


Theorem 1. Let (x(w), y(t), S(w)) be on the central path. 


i) If f(x) and g;(x) are convex functions for all i, then s(1) is unique. 

ii) Furthermore, if f(x) — w>;log(—g,(x))_ is _ strictly convex, 
(x(), y(4), S(z)) are unique, and they are bounded for 0 < w < p° for 
any given ° > 0. 

iii) For0 <p! <p, f(x(w’)) < f(x(e)) if x(w’) A x(u). 

iv) (x(w), y(w), S()) converges to a point satisfying the first-order necessary 
conditions for a solution of (48) as w — 0. 


15.8 Interior Point Methods 489 


Once we have an approximate solution point (x, y,s) = (X;, y;,§,) for (50) 
for = w* > 0, we can again use the primal-dual methods described for linear 
programming to generate a new approximate solution to (50) for w= pXt! < pk. 
The Newton direction (d,,d,,d,) is found from the system of linear equations: 


—SVg(x)d, — G(x)d, = 1+ Sg(x), (51) 
Ad, = b— Ax, 


—A'd, + (v0 = ae +400) d, 
+Vg(x)"d, = Ay — Vf(x)’ — Va(x)’s, 


where G(x) = diag(g(x)). 

Recently, this approach has also been used to find points satisfying the first- 
order conditions for problems when f(x) and g;(x) are not generally convex 
functions. 


Quadratic Programming 
Let f(x) = (1/2)x’Qx+e?x and g,(x) = —x, for i= 1,...,n, and consider the 
quadratic program 
minimize $x’Qx+¢7x 
subject to Ax = b, (52) 
x>0, 


where the given matrix Q € E”*” is positive semidefinite (that is, the objective is 
a convex function), A € E”*”, ¢ € E” and b € E”. The problem reduces to finding 
x¢€ E", ye E" and se E" satisfying the following optimality conditions: 


Sx = 0 

Ax=b 
—A’y+Qx-s= —¢ (53) 

(x,s)>0. 


The optimality conditions with the logarithmic barrier function with parameter ju 
are be: 


Sx= pl 
Ax=b (54) 
—A’y+Qx-—s=-—c. 


Note that the bottom two sets of constraints are linear equalities. 
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Thus, once we have an interior feasible point (x, y, s) for (54), with uw = x’s/n, 
we can apply Newton’s method to compute a new (approximate) iterate (xt, yt, s+) 
by solving for (d,,d,,d,) from the system of linear equations: 


Sd, + Xd, — ypl a Xs, 
Ad, = 0, (95) 
—A'd, aT Qd, ~ d, = 0, 


where X and S are two diagonal matrices whose diagonal entries are x > 0 and 
s > 0, respectively. Here, y is a fixed positive constant less than 1, which implies 
that our targeted y is reduced by the factor y at each step. 


Potential Function 


For any interior feasible point (x, y, s) of (52) and its dual, a suitable merit function 
is the potential function introduced in Chapter 5 for linear programming: 


Wr p(X s) = (n + p) log(x’s) i > log(x;5;). 
j=l 
The main result for this is stated in the following theorem. 


Theorem 2. In solving (55) for (d,, dy, d,), let y=n/(n+p) <1 for fixed 
p > /n and assign xt =x+ad,, y* =y+ad,, and st =s+ad, where 


a,/min(Xs) 


a= 7 : 
|(XS)-/?(2*1—Xs)| 


n+p 


where & is any positive constant less than |. (Again X and S are matrices with 
components on the diagonal being those of x and §, respectively.) Then, 


=2 


Pn p(X" s*) ~~ Ur+p(% s) S =a aia ie a) 


The proof of the theorem is also similar to that for linear programming; see 
Exercise 12. Notice that, since Q is positive semidefinite, we have 


d,"d, = (d,, d,)"(d,, 0) = dy Qd, > 0 


while d7d, = 0 in the linear programming case. 
We outline the algorithm here: 


Given any interior feasible (X9,Yo,8 ) of (52) and its dual. Set p > ./n and 
k=0. 
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1. Set (x, 8) = (x, §,) and y=n/(n+p) and compute (d,, dy, d,) from (55). 
2. Let X,4) =X, + ad, ¥p.) =Y, + ady, and s,,, = s,+ ad, where 


a = arg min Pnsp (x; + ad,, Si + ad,). 


3. Letk=k+1. If s{x,/s}X < e, stop. Otherwise, return to Step 1. 


This algorithm exhibits an iteration complexity bound that is identical to that of 
linear programming expressed in Theorem 2, Section 5.6. 


15.9 SEMIDEFINITE PROGRAMMING 


Semidefinite programming (SDP) is a natural extension of linear programming. In 
linear programming, the variables form a vector which is required to be component- 
wise nonnegative, while in semidefinite programming the variables are compo- 
nents of a symmetric matrix constrained to be positive semidefinite. Both types 
of problems may have linear equality constraints as well. Although semidef- 
inite programs have long been known to be convex optimization problems, no 
efficient solution algorithm was known until, during the past decade or so, it 
was discovered that interior-point algorithms for linear programming discussed in 
Chapter 5, can be adapted to solve semidefinite programs with both theoretical and 
practical efficiency. During the same period, it was discovered that the semidefinite 
programming framework is representative of a wide assortment of applications, 
including combinatorial optimization, statistical computation, robust optimization, 
Euclidean distance geometry, quantum computing, and optimal control. Semidef- 
inite programming is now widely recognized as a powerful model of general 
importance. 

Suppose A and B are mxn matrices. We define Ae B = trace(A’B) = 
»:,; 4:;5;;- In semidefinite programming, this definition is almost always used for 
the case where the matrices are both square and symmetric. 

Now let C and A;, i= 1, 2,..., m, be given n-dimensional symmetric matrices 
and b € E”. And let X be an unknown n-dimensional symmetric matrix. Then, the 
primal semidefinite programming problem is 


(SDP) minimize Ce X 


(56) 
subject to A;eX =b,,i=1,2,...,m, X>0. 
The notation X > 0 means that X is positive semidefinite, and X > 0 means that X 
is positive definite. If a matrix X > 0 satisfies all equalities in (56), it is called a 
(primal) strictly or interior feasible solution. 
Note that in semidefinite programming we minimize a linear function of a 
symmetric matrix constrained in the cone of positive semidefinite matrices and 
subject to linear equality constraints. 
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We present several examples to illustrate the flexibility of this formulation. 


Example 1 (Binary quadratic optimization). Consider a binary quadratic 
optimization problem 


minimize x’ Qx + 2c’x 
subject to x; = {1,1}, for all j=1,...,7, 


which is a difficult nonconvex optimization problem. The problem can be rewritten 


as 
T 

ae x Qc||x 

z* = minimize H all 7] 


subject to (x;)? =1, for all j=1,...,n, 


which can be also written as 


T 
z* = minimize E | e fil H 


T 
subject wore[T] H =1, forall j=1,...,n, 


where I; is the (7 +1) x (+1) matrix whose components are all zero except at 
the jth position on the main diagonal where it is 1. 


Since iil forms a positive-semidefinite matrix (with rank equal to 1), a 


semidefinite relaxation of the problem is defined as 


fee Qe 
z5PP = minimize [oo eY 
subject toleY=1, for all j=1,...,.n+1, (57) 
Y>0, 
where the symmetric matrix Y has dimension n+ 1. Obviously, 2°?" is a lower 


bound of z*, since the rank-1 constraint is not enforced in the relaxation. 

For simplicity, assuming 7°?’ > 0, it has been shown that in many cases 
of this problem an optimal SDP solution either constitutes an exact solution or 
can be rounded to a good approximate solution of the original problem. In the 
former case, one can show that a rank-1 optimal solution matrix Y exists for the 
semidefinite relaxation and it can be found by using a rank-reduction procedure. 
For the latter case, one can, using a randomized rank-reduction procedure or the 
principle components of Y, find a rank-1 feasible solution matrix Y such that 


E [eta Z az" 
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for a provable factor a > 1. Thus, one can find a feasible solution to the original 
problem whose objective cost is no more than a factor a higher than the minimal 
objective cost. 


Example 2 (Linear Programming). To see that the problem (SDP) (that is, 
(56)) generalizes linear programing define C = diag[c,,c,,...,c,], and let A; = 
diag[aj,, 4j.,..-, 4;,| for i= 1,2, ...m. The unknown is the n x n symmetric matrix 
X which is constrained by X > 0. Since the trace of Ce X depends only on the 
diagonal elements of X, we may restrict the solutions X to diagonal matrices. It 
follows that in this case the problem can be recast as the linear program 


minimize cx (58) 
subject to Ax = b 


x >0. 


Example 3 (Sensor localization). This problem is that of determining the location 
of sensors (for example, several cell phones scattered in a building) when measure- 
ments of some of their separation distances can be determined, but their specific 
locations are not known. In general, suppose there are n unknown points x; € ES, 
j=1,...,n. We consider an edge to be a path between two points, say, i and j. 
There is a known subset N, of pairs (edges) ij for which the separation distance d;; 
is known. For example, this distance might be determined by the signal strength or 
delay time between the points. Typically, in the cell phone example, N, contains 
those edges whose lengths are small so that there is a strong radio signal. Then, the 
localization problem is to find locations X;; j=1,...,n, such that 


|x; x,’ =(d,,)’, for all (i, j) EN, 


subject to possible rotation and translation. (If the locations of some of the sensors 
are known, these may be sufficient to determine the rotation and translation). 
Let X = [x, x, ... x,] be the d x n matrix to be determined. Then 


|x; —X, 


i “il = (e;— e;)'X’X(e; _ e;), 


where e; € E” is the vector with | at the ith position and zero everywhere else. Let 
Y = X’X. Then the semidefinite relaxation of the localization problem is to find Y 
such that 
(e,—e,)(e,-e;))"eY¥=(d,,)?, for all (i, f) EN, 
Y>0. 


This problem is one of finding a feasible solution; the objective function is zero. 
For certain instances, factorization of Y provides a unique localization X to the 
original problem. 
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Duality 


Because semidefinite programming is an extension of linear programming, it would 
seem that there is a natural dual to the primal problem, and that this dual is itself 
a semidefinite program. This is indeed the case, and it is related to the primal in 
much the same way as primal and dual linear programs are related. Furthermore, 
the primal and dual together lead to the formation a primal—dual solution method, 
which is discussed later in this section. 

The dual of the primal (SDP) is 


(SDD) maximize y’b 
subject to >)" yA; +S=C, (59) 
S>0. 


As in much of linear programming, the vector of dual variable is often labeled 
y rather than A and this convention is followed here. Notice that S represents a 
slack matrix, and hence the problem can alternatively be expressed as 


maximize y’b 


(60) 
subject to >)" y,A; < C. 


The duality is manifested by the relation between the optimal values of the 
primal and dual programs. The weak form of this relation is spelled out in the 
following lemma, the proof of which, like the weak form of other duality relations 
we have studied, is essentially an accounting issue. 


Weak Duality in SDP. Let X be feasible for (SDP) and (y,S) feasible for 
(SDD). Then, 


CeX>b’y. 


Proof. By direct calculation 


Ce X—b’y=()_y,;A;+S) eX—b’y=) (A; e X)y, + SeX—b’y=S eX. 
i=l i=1 
Since both X and S are positive semidefinite, it follows that Se X > 0. J 


Let us consider some examples of dual problems. 
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Example 4 (The dual of binary quadratic optimization). Consider the semidefinite 
relaxation (57) for the binary quadratic problem. It’s dual is 


n=1 


maximize 7") Y; 


subject to peas yJ,+S= E | 
S>0. 


Note that 
Q c n+1 
E 0 -> yl; 
i=l 


is the Hessian matrix of the Lagrange function of the quadratic problem; see 
Chapter 11. 


Example 5 (Dual linear program). The dual of the linear program (58) is 


maximum b’y 


subject to A’y <c. 
It can be written as 
maximum b’y 
subject to diag(ec — A’y) > 0 


where as usual diag(c) denotes the diagonal matrix whose diagonal elements are 
the components of c. 


Example 6 (The dual of sensor localization). Consider the semidefinite 
programming relaxation for the sensor localization problem. It’s dual is 


maximize )/ (jen, Vij 
subject to yj, yen, ¥i(€; —&;)(e; —e,)’ +S = 0, 
S>0. 


Here, y;; represents an internal force or tension on edge (i, j). Obviously, y;; = 0 
for all (i, /) € N, is a feasible solution for the dual. However, finding non-trivial 
internal forces is a fundamental problem in network and structure design. 


Example 7 (Quadratic constraints). Quadratic constraints can be transformed to 
linear semidefinite form by using the concept of Schur complements. To introduce 
this concept, consider the quadratic problem 


minimize, x’ Ax+2y’B’x+y’Cy, 
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where A is positive definite and C is symmetric. This has solution with respect to 
x for fixed y of 
x = —A’'By. 


The minimum value is then 
T 
x A BI/x]_o7 
[+] [oe] |p] =2"s 


S=C-—B’A™!B. 


where 


The matrix S is the Schur complement of A in the matrix 


A B 
z=|5 cl: 


From this it follows that Z is positive semidefinite if and only if S is positive 
semidefinite (still assuming that A is positive definite). 
Now consider a general quadratic constraint of the form 


xB’ Bx —c’x—d>0. (61) 


This is equivalent to 


I Bx 


because the Schur complement of this matrix with respect to I is the negative of the 
left side of the original constraint (61). Note that in this larger matrix, the variable 
X appears only afinely, not quadratically. 

Indeed, (62) can be written as 


P(x) =Po+x,P,+%,P,+-:-x,P, = 0, (63) 


where 


I 0 0 »b, a 
ae le = [yr | fori=1,2,...n 


with b, being the ith column of B and c, being the ith component of c. The constraint 
(63) is of the form that appears in the dual form of a semidefinite program. 
Suppose the original optimization problem has a quadratic objective: minimize 
q(x). The objective can be written instead as: minimize t subject to g(x) < t, and 
then this constraint as well as any number of other quadratic constraints can be 
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transformed to semidefinite constraints, and hence the entire problem converted to 
a semidefinite program. This approach is useful in many applications, especially in 
various problems of control theory. 


As in other instances of duality, the duality of semidefinite programs is weak 
unless other conditions hold. We state here, but do not prove, a version of the strong 
duality theorem. 


Strong Duality in SDP. Suppose (SDP) and (SDD) are both feasible and at 
least one of them has an interior. Then, there are optimal solutions to the 
primal and the dual and their optimal values are equal. 


If the non-empty interior condition of the above theorem does not hold, then 
the duality gap may not be zero at optimality. 


Example 8 The following semidefinite program has a duality gap: 
0 1 0 0 0 0 0 -1 0 
C=/1 0 0];,A,=/0 1 0], A,=]-1 O 0 
0 0 0 0 0 0 0 O02 
and 


=[2]. 


The primal minimal objective value is 0 achieved by 
0 0 0 
X=/]0 0 0 
0 05 


and the dual maximal objective value is —10 achieved by y = [0, —1]; so the duality 
gap is 10. 


Interior-Point Algorithms for SDP 


Let the primal (SDP) and dual (SDD) semidefinite programs both have interior 
point feasible solutions. Then, the central path can be expressed as 


C= {(X.y,8) eF: XS=ul, 0<p<o}, 
The primal-dual potential function for SDP, a descent merit function, is 
Wn+p(X, 8) = (n+ p) log(X eS) — log(det(X) - det(S)) 


where p > 0. Note that if X and S are diagonal matrices, these definitions reduce 
to those for linear programming. 
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Once we have an interior feasible point (X, y, S), we can generate a new iterate 
(Xt, y*,S*) by solving for (Dy, d,,D,) from the primal-dual system of linear 
equations 


D-'D,D-! +D, = yuX-!—S, 
A;eD, = 0, for all i, (64) 
= pave (d,);A; _ Ds =0, 


where D is the (scaling) matrix 
D = X?(X?SX?)-?X2 


and = XeS/n. Then one assigns Xt = X+ @Dx, y* =y+ad,, and St =s+aD, 
where 


(X+ aD,,S+aD,). 


a =argmin 
g a>0 Vnsp 


Furthermore, it can be shown that 


Vnsp (xt, S*) = Vnsp (X, S) < —6 


for a constant 6 > 0.2. 
This provides an iteration complexity bound that is identical to linear 
programming as discussed in Chapter 5. 


15.10 SUMMARY 


A constrained optimization problem can be solved by directly solving the equations 
that represent the first-order necessary conditions for a solution. For a quadratic 
programming problem with linear constraints, these equations are linear and thus 
can be solved by standard linear procedures. Quadratic programs with inequality 
constraints can be solved by an active set method in which the direction of movement 
is toward the solution of the corresponding equality constrained problem. This 
method will solve a quadratic program in a finite number of steps. 

For general nonlinear programming problems, many of the standard methods 
for solving systems of equations can be adapted to the corresponding necessary 
equations. One class consists of first-order methods that move in a direction related 
to the residual (that is, the error) in the equations. Another class of methods is based 
on extending the method of conjugate directions to nonpositive-definite systems. 
Finally, a third class is based on Newton’s method for solving systems of nonlinear 
equations, and solving a linearized version of the system at each iteration. Under 
appropriate assumptions, Newton’s method has excellent global as well as local 
convergence properties, since the simple merit function, 5|V f(x) +A‘ Vh(x)|? + 
5|h(x) 2 decreases in the Newton direction. An individual step of Newton’s method 
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is equivalent to solving a quadratic programming problem, and thus Newton’s 
method can be extended to problems with inequality constraints through recursive 
quadratic programming. 

More effective methods are developed by accounting for the special structure of 
the linearized version of the necessary conditions and by introducing approximations 
to the second-order information. In order to assure global convergence of these 
methods, a penalty (or merit) function must be specified that is compatible with 
the method of direction selection, in the sense that the direction is a direction of 
descent for the merit function. The absolute-value penalty function and the standard 
quadratic penalty function are both compatible with some versions of recursive 
quadratic programming. 

The best of the primal-dual methods take full account of special structure, 
and are based on direction-finding procedures that are closely related to methods 
described in earlier chapters. It is not surprising therefore that the convergence 
properties of these methods are also closely related to those of other chapters. Again 
we find that the canonical rate is fundamental for properly designed first-order 
methods. 

Interior point methods in the primal—dual mode are very effective for treating 
problems with inequality constraints, for they avoid (or at least minimize) the diffi- 
culties associated with determining which constraints will be active at the solution. 
Applied to general nonlinear programming problems, these methods closely parallel 
the interior point methods for linear programming. There is again a central path, 
and Newton’s method is a good way to follow the path. 

A relatively new class of mathematical programming problems is semidefinite 
programming, where the unknown is a matrix and at least some of the constraints 
require the unknown matrix to be positive semidefinite (or negative semidefinite). 
There is a variety of interesting and important practical problems that can be 
naturally cast in this form. Because many problems which appear nonlinear (such 
as quadratic problems) become essentially linear in semidefinite form, the efficient 
interior point algorithms for linear programming can be extended to these problems 
as well. 
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1. Solve the quadratic program 


minimize x? —xy+y?—3x 
subject to x>0 
y20 


x+ty<4 


by use of the active set method starting at x = y= 0. 
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2. Suppose x*, A* satisfy 


Vf(x*) + A*’ Vih(x*) = 0 
h(x*) =0. 


Let 


cH L(x*, A*) Vh(x*)? 
=| rage 


Assume that L(x*, A*) is positive definite and that Vh(x*) is of full rank. 
a) Show that the real part of each eigenvalue of C is positive. 
b) Using the result of Part (a), show that for some a > 0 the iterative process 
Xp =X — aVI(K,, Ay)” 
Agu = Ay + ah(x;,) 
converges locally to x*, A*. (That is, if started sufficiently close to x*, A*, the process 
converges to x*, A*.) Hint: Use Ostroski’s Theorem: Let A(z) be a continuously 
differentiable mapping from E? to E”, assume A(z*) = 0, and let VA(z*) have all 


eigenvalues strictly inside the unit circle of the complex plane. Then z,,, =z, +A(z;,) 
converges locally to z*. 


3. Let A be a real symmetric matrix. A vector x is singular if x’ Ax = 0. A pair of vectors 
x, y is a hyperbolic pair if both x and y are singular and x’ Ay 4 0. Hyperbolic pairs can 
be used to generalize the conjugate gradient method to the nonpositive definite case. 


a) If p, is singular, show that if p,,, is defined as 
(Ap,)"A°P, 
Pri = AD, — ~QlAp,2 Pe 


then p,, Pp, is a hyperbolic pair. 
b) Consider a modification of the conjugate gradient process of Section 8.3, where if 
p, is singular, p,,; is generated as above, and then 
Xp = Xp TF &Dy 


Xeyo = Xp + Op Pest 


T T 
= TY, Pri — Pr 
k= ke FQ 
Py APk+1 Py APe+1 
T 
= Ty 2APK 41 
Pry2 = liga - a ae 
PAPA +1 


Show that if p,,, is the second member of a hyperbolic pair and r, 4 0, then 
X42 #Xz4,, which means the process does not get “stuck.” 
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4. Another method for solving a system Ax = b when A is nonsingular and symmetric 
is the conjugate residual method. In this method the direction vectors are constructed 
to be an A?-orthogonalized version of the residuals r, = b—Ax,. The error function 
E(x) = |Ax—b/? decreases monotonically in this process. Since the directions are based 
on r, rather than the gradient of E, which is 2Ar,, the method extends the simplicity 
of the conjugate gradient method by implicit use of the fact that A? is positive definite. 
The method is this: Set py = r, = b— Ax, and repeat the following steps, omitting (a, 
b) on the first step. 


If a,_, #90, 
TAP 
R=y-Aa 2.= =: (65a) 
k=, kPx-1 k pr AP, 
If a,_, = 0, 
Pe = AL, = VePe—1 — 8 Py2 
TA3 : TA3 : 
yet vt Bes = af pe (65b) 
Py APPx-1 Py_-2APPx_-2 
r7 Ap, 
Xp) =X + Py, a= soe (65c) 
yy, = b—AX;,, 1. (65d) 


Show that the directions p, are A?-orthogonal. 


5. Consider the (n+ m)-dimensional system of equations 


L A’|[x]_f[a 

A 0 A} [Lb] 
Suppose that A = [B, C], where B is m x m and invertible. Let x = (xp, x,.), where x, 
is the first m components of x. The system can then be written 


Lgp Lgc Bl Xp ag 
Leg Lee C* Xc | = | ac 
B C 0 A b 


a) Assume that L is positive definite on the tangent space {x : Ax = 0}. Derive an 
explicit statement equivalent to this assumption in terms of the positive definiteness 
of some (n—m) x (n—m) matrix. 

b) Solve the system in terms of the submatrices of the partitioned form. 


6. Consider the partitioned square matrix M of the form 
AB 
= E al 
Show that 


M = Q —QBD"' 
—D'CQ D'!+D-'!CQBD"™ |’ 
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where Q = (A—BD™'C)-', provided that all indicated inverses exist. Use this result to 
verify the rate of convergence result in Section 15.7. 


7. For the problem 


minimize f(x) 


subject to g(x) <0, 
where g(x) is r-dimensional, define the penalty function 


P(x) = f(x) + cmax{0, g) (x), 8(x),.--. 8-(X)}- 
Let d, (d 4 0) be a solution to the quadratic program 
Peer: 1 T 
minimize 54 Bd+ V/(x)d 
subject to g(x)+ Vg(x)d <0, 


where B is positive definite. Show that d is a descent direction for p for sufficiently 
large c. 


8. Suppose the quadratic program of Exercise 7 is not feasible. In that case one may solve 


1 
minimize 50'Bd + Vf(x)d+c& 
subject to g(x)+ Vg(x)d < é1 
§ 20. 


a) Show that if d 4 0 is a solution, then d is a descent direction for p. 
b) If d= 0 is a solution, show that x is a critical point of p in the sense that for any 
d+ 0, p(x+ad) > p(x)+0(a). 


9. For the equality constrained problem, consider the function 
(x) = f(x) + A(x)"h(x) + ch(x)"C(x)C(x)"h(x), 
where 
C(x) = [Vh(x) Vh(x)’]"'Vh(x) and A(x) = C(x) Vf(x)’. 


a) Under standard assumptions on the original problem, show that for sufficiently large 
c, @ is (locally) an exact penalty function. 
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b) Show that (x) can be expressed as 
b(x) = £(x) + a(x) h(x), 


where r(x) is the Lagrange multiplier of the problem 


1 
minimize xed’ + Vf(x)d 
subject to Vh(x)d+h(x) = 0. 


c) Indicate how ¢ can be defined for problems with inequality constraints. 


10. Let {B,} be a sequence of positive definite symmetric matrices, and assume that there 
are constants a > 0, b > 0 such that a|x|? < x’B,x < b|x|? for all x. Suppose that B is 
replaced by B, in the kth step of the recursive quadratic programming procedure of the 
theorem in Section 15.5. Show that the conclusions of that theorem are still valid. Hint: 
Note that the set of allowable B,’s is closed. 


11. (Central path theorem) Prove the central path theorem, Theorem | of Section 15.8, for 
convex optimization. 


12. Prove the potential reduction theorem, Theorem 2 of Section 15.8, for convex quadratic 
programming. This theorem can be generalized to non-quadratic convex objective 
functions f(x) satisfying the following condition: let 

v: (0, 1) > (1, ~) 
be a monotone increasing function; then 
IX(Vf(x +d,) — V(x) — V?f(x)d,)|, < v(a)dy V(x) dy 
whenever 
x>0, |X'd|j<a<l. 
Such condition is called the scaled Lipschitz condition in {x: x > O}. 
13. Let A and B be two symmetric and positive semidefinite matrices. Prove that 


AeBS 0. 


14. (Farkas’ lemma in SDP) Let A;, i= 1,...,m, have rank m (that is, }°;" y,A; = 0 implies 
y = 0). Then, there exists a symmetric matrix X > 0 with 


A,;eX=b,, i=1,...,m, 


if and only if 37" y,A; < 0 and 377" y,A; 4 0 imply b’y < 0. 


15. Let X and S both be positive definite. Prove that 


nlog(X eS) — log(det(X) - det(S)) > nlogn. 


504 Chapter 15 Primal-Dual Methods 


16. Consider a SDP and the potential level set 


WS) = {CX y, 8S) eF: Wry_(X%,S) < 5}. 
Prove that 
W(S') CWS’) if 6' <8, 


and for every 5, (5) is bounded and its closure V(6) has non-empty intersection with 
the SDP solution set. 


17. Letboth(SDP) and(SDD) have interior feasible points. Then for any 0 < jz < oo, the central 
path point (X(w), y(w), S(x)) exists and is unique. Moreover, 


i) the central path point (X(), y(),S()) is bounded where 0 < pw < p° for any 
given 0 <p? <o. 
ii) ForO <p’ <p, 


CeX(u’)<CeX(u) and b’y(u’) > b’y(u) 


if X(w) 4 X(w’) and y(w) A y(v’). 

iii) (X(u), y(u), S(w)) converges to an optimal solution pair for (SDP) and (SDD), 
and the rank of the limit of X(j) is maximal among all optimal solutions of (SDP) 
and the rank of the limit S(jz) is maximal among all optimal solutions of (SDD). 
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Appendix A MATHEMATICAL 
REVIEW 


The purpose of this appendix is to set down for reference and review some basic 
definitions, notation, and relations that are used frequently in the text. 


A.1 SETS 


If x is a member of the set S, we write x ¢ S. We write y ¢ S if y is not a member 
of S. 

A set S may be specified by listing its elements between braces; such as, 
for example, S = {1,2,3,4}. Alternatively, a set can be specified in the form 
S = {x : P(x)} as the set of elements satisfying property P; such as S={x:1< 
x <4, x integer}. 

The union of two sets S and T is denoted SUT and is the set consisting of 
the elements that belong to either S$ or T. The intersection of two sets S and T is 
denoted SMT and is the set consisting of the elements that belong to both S and 
T. If S is a subset of T, that is, if every member of S is also a member of 7, we 
write SC TorTODS. 

The empty set is denoted ¢ or J. There are two ways that operations such as 
minimization over a set are represented. Specifically we write either 


min f(x) or min { f(x): x € S} 


to denote the minimum value of f over the set S. The set of x’s in S that achieve 
the minimum is denoted argmin { f(x) : x € S}. 


Sets of Real Numbers 


If a and 5 are real numbers, [a, b] denotes the set of real numbers x satisfying 
a<x <b. A rounded, instead of square, bracket denotes strict inequality in the 
definition. Thus (a, b] denotes all x satisfying a<x<b. 
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If S is a set of real numbers bounded above, then there is a smallest real number 
y such that x < y for all x € §. The number y is called the least upper bound or 
supremum of S and is denoted 


sup (x) or sup {x: x € S}. 


xeS 


Similarly, the greatest lower bound or infimum of a set S is denoted 


inf (x) or inf {x:x € S}. 
x€e 


A.2 MATRIX NOTATION 


A matrix is a rectangular array of numbers, called elements. The matrix itself is 
denoted by a boldface letter. When specific numbers are not used, the elements are 
denoted by italicized lower-case letters, having a double subscript. Thus we write 

4, 22 °°" Ain 


2, Aq. *** Ady, 


A= 


Gini Am2°** Ann 


for a matrix A having m rows and n columns. Such a matrix is referred to as an 
m xX n matrix. If we wish to specify a matrix by defining a general element, we use 
the notation A = [a;;]. 

An m xn matrix all of whose elements are zero is called a zero matrix and 
denoted 0. A square matrix (a matrix with m =n) whose elements a;; = 0 for i F j, 
and a,;,= 1 for i= 1,2,...,1 is said to be an identity matrix and denoted I. 

The sum of two m xn matrices A and B is written A+B and is the matrix 
whose elements are the sum of the corresponding elements in A and B. The product 
of a matrix A and a scalar \, written NA or AX, is obtained by multiplying each 
element of A by X. The product AB of an m x n matrix A and an n x p matrix B 
is the m x p matrix C with elements c,; = )7y_, 4ix4,;- 

The transpose of an m x n matrix A is the n x m matrix A’ with elements aj; = 
aj. A (square) matrix A is symmetric if AT = A. A square matrix A is nonsingular 
if there is a matrix A~!, called the inverse of A, such that A7'A =I = AA™!. The 
determinant of a square matrix A is denoted by det (A). The determinant is nonzero 
if and only if the matrix is nonsingular. Two square n x n matrices A and B are 
similar if there is a nonsingular matrix S such that B= S~'AS. 

Matrices having a single row are referred to as row vectors; matrices having a 
single column are referred to as column vectors. Vectors of either type are usually 
denoted by lower-case boldface letters. To economize page space, row vectors are 
written a = [d,,da5,...,a,] and column vectors are written a = (d,,d),...,d,). 
Since column vectors are used frequently, this notation avoids the necessity to 
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display numerous columns. To further distinguish rows from columns, we write 
a ¢ E” if a is a column vector with n components, and we write b € E, if b is a 
row vector with n components. 

It is often convenient to partition a matrix into submatrices. This is indicated 
by drawing partitioning lines through the matrix, as for example, 


1 Ay | 413 A14 


A= _ | An Ais 
= | 1 Ay2 | 473 Arg | — Arse May | 


31 43 | 433 434 


The resulting submatrices are usually denoted A;,, as illustrated. 

A matrix can be partitioned into either column or row vectors, in which case 
a special notation is convenient. Denoting the columns of an m x n matrix A by 
a, j= 1,2,...,”, we write A = [a,,a,,...,a,,]. Similarly, denoting the rows of A 
by a’,i=1,2,...,m, we write A = (a!,a’,..., a’). Following the same pattern, 
we often write A = [B, C] for the partitioned matrix A = [B|C]. 


A.3 SPACES 


We consider the n-component vectors x = (x,,.X7,...,%,) as elements of a vector 
space. The space itself, n-dimensional Euclidean space, is denoted E”. Vectors in 
the space can be added or multiplied by a scalar, by performing the corresponding 
operations on the components. We write x > 0 if each component of x is nonneg- 
ative. 

The line segment connecting two vectors x and y is denoted [x, y] and consists 
of all vectors of the form ax+ (1—a)y withO<a< 1. 

The scalar product of two vectors x = (x,,X,,...,%X,) and y= (y,, ¥o,---5 ,) 
is defined as x'y = y’x = )-"_, x,y;. The vectors x and y are said to be orthogonal 
if x’y = 0. The magnitude or norm of a vector x is |x| = (x’x)'/?. For any two 
vectors x and y in E”, the Cauchy-Schwarz Inequality holds: |x’ y| < |x| - ly]. 

A set of vectors a,,@,...,a, is said to be linearly dependent if there are 
scalars \,,\5,..-,A,, not all zero, such that a \,a; = 0. If no such set of scalars 
exists, the vectors are said to be linearly independent. A linear combination of the 
vectors a,,@,..., a, is a vector of the form )~*, \,a,. The set of vectors that are 
linear combinations of a,,a,..., a, is the set spanned by the vectors. A linearly 
independent set of vectors that span E” is said to be a basis for E". Every basis for 
E" contains exactly n vectors. 

The rank of a matrix A is equal to the maximum number of linearly independent 
columns in A. This number is also equal to the maximum number of linearly 
independent rows in A. The m x n matrix A is said to be of full rank if the rank of 
A is equal to the minimum of m and n. 

A subspace M of E" is a subset that is closed under the operations of vector 
addition and scalar multiplication; that is, if a and b are vectors in M, then Aa+ ub 
is also in M for every pair of scalars A, w. The dimension of a subspace M is equal 
to the maximum number of linearly independent vectors in M. If M is a subspace 
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of E", the orthogonal complement of M, denoted M+, consists of all vectors that 
are orthogonal to every vector in M. The orthogonal complement of M is easily 
seen to be a subspace, and together M and M+ span E” in the sense that every 
vector x € E” can be written uniquely in the form x = a+b witha ¢ M,be M+. In 
this case a and b are said to be the orthogonal projections of x onto the subspaces 
M and M-, respectively. 

A correspondence A that associates with each point in a space X a point in 
a space Y is said to be a mapping from X to Y. For convenience this situation is 
symbolized by A: X — Y. The mapping A may be either linear or nonlinear. The 
norm of linear mapping A is defined as |A| = max |Ax|. It follows that for any 


I 
Ax| <|A]- |x|. 


x, 


A.4_ EIGENVALUES AND QUADRATIC FORMS 


Corresponding to an n x n square matrix A, a scalar \ and a nonzero vector x 
satisfying the equation Ax = Ax are said to be, respectively, an eigenvalue and 
eigenvector of A. In order that \ be an eigenvalue it is clear that it is necessary and 
sufficient for A — AI to be singular, and hence det (A — AI) = 0. This last result, 
when expanded, yields an nth-order polynomial equation which can be solved for 
n (possibly nondistinct) complex roots \ which are the eigenvalues of A. 

Now, for the remainder of this section, assume that A is symmetric. Then the 
following properties hold: 


i) The eigenvalues of A are real. 
ii) Eigenvectors associated with distinct eigenvalues are orthogonal. 
iii) There is an orthogonal basis for E”, each element of which is an eigenvector 
of A. 


If the basis u,,U,,..., U,, in (iii) is normalized so that each element has magnitude 
unity, then defining the matrix Q = [u,,u),...,u,,] we note that Q7Q =I and 


: nh 
hence Q? = Q-!. A matrix with this property is said to be an orthogonal matrix. 
Also, we observe, in this case, that 


Q'AQ = Q’AQ= Q’[Au,, Au,,..., Au,] 
= Q’[\,u,, \4.u,...,\,u, J. 


Thus 


and therefore A is similar to a diagonal matrix. 
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A symmetric matrix A is said to be positive definite if the quadratic form x’ Ax 
is positive for all nonzero vectors x. Similarly, we define positive semidefinite, 
negative definite, and negative semidefinite if x’ Ax > 0, < 0, or < 0 for all x. The 
matrix A is indefinite if x’ Ax is positive for some x and negative for others. 

It is easy to obtain a connection between definiteness and the eigenvalues of A. 
For any x let y= Q-'x where Q is defined as above. Then x? Ax = y’Q’ AQy = 
>, \;y7. Since the y,’s are arbitrary (since x is), it is clear that A is positive 
definite (or positive semidefinite) if and only if all eigenvalues of A are positive 
(or nonnegative). 

Through diagonalization we can also easily show that a positive semidefinite 
matrix A has a positive semidefinite (symmetric) square root A‘ satisfying A!” - 
A!/? — A. For this we use Q as-above and define 


1/2 
dy 


AV?=Q : Oo, 


1/2 
he 


which is easily verified to have the desired properties. 


A.5 TOPOLOGICAL CONCEPTS 


A sequence of vectors Xp, X,,...,X,,--., denoted {x,}?° = 0, or if the index set is 
understood, by simply {x,}, is said to converge to the limit x if |x, —x| — 0 as 
k — oo (that is, if given € > 0, there is a N such that k > N implies |x, —x| < ). 
If {x,} converges to x, we write x, — x or limit x, =x. 

A point x is a limit point of the sequence {x,} if there is a subsequence of {x,} 
convergent to x. Thus x is a limit point of {x,} if there is a subset K of the positive 
integers such that {x,},<4 is convergent to x. 

A sphere around x is a set of the form {y : |y—x| < e} for some ¢ > 0. Such 
a sphere is also referred to as the neighborhood of x of radius e. 

A subset S of E” is open if around every point in S there is a sphere that is 
contained in S. Equivalently, S is open if given x € S there is an ¢ > O such that 
ly —x| < € implies y € S. Thus the sphere {x : |x| < 1} is open. In general, open sets 
can be characterized as sets having no sharp boundaries. The interior of any set S 
in E" is the set of points x € S which are the center of some sphere contained in 


S. It is denoted s. The interior of a set is always open; indeed it is the largest open 
set contained in S. The interior of the set {x : |x| < 1} is the sphere {x : |x| < 1}. 

A set P is closed if every point that is arbitrarily close to the set P is a member 
of P. Equivalently, P is closed if x, — x with x, € P implies x € P. Thus the set 
{x : |x| < 1} is closed. The closure of any set P in E” is the smallest closed set 
containing P. It is denoted S. The boundary of a set is that part of the closure that 
is not in the interior. 
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A set is compact if it is both closed and bounded (that is, if it is closed 
and is contained within some sphere of finite radius). An important result, due to 
Weierstrass, is that if S is a compact set and {x,} is a sequence each member of 
which belongs to S, then {x,} has a limit point in S (that is, there is subsequence 
converging to a point in S). 

Corresponding to a bounded sequence {r,}%) of real numbers, if we let s, = 
sup{r; : i > k} then {s,} converges to some real number s,. This number is called 
the limit superior of {r,} and is denoted lim (7). 


A.6 FUNCTIONS 


A real-valued function f defined on a subset of E” is said to be continuous at x 
if x, > x implies f(x,) > f(x). Equivalently, f is continuous at x if given e > 0 
there is a 8 > 0 such that |y — x| < 6 implies | f(y) — f(x)| < e. An important result 
connected with continuous functions is a theorem of Weierstrass: A continuous 
function f defined on a compact set S has a minimum point in S; that is, there is 
an x* € § such that for all x € S, f(x) > f(x*). 

A set of real-valued functions /f,, f5,...,f,, on E” can be regarded as a 
single vector function f = (f,, f,,..., f,,). This function assigns a vector f(x) = 
Cf, (x), fox), ..-5 fin (X)) in E” to every vector x € E”. Such a vector-valued 
function is said to be continuous if each of its component functions is continuous. 

If each component of f = (|, fo,.-.., f,,) iS continuous on some open set of 
E", then we write f € C. If in addition, each component function has first partial 
derivatives which are continuous on this set, we write f € C!. In general, if the 
component functions have continuous partial derivatives of order p, we write f € C?. 

If f € C! is a real-valued function on E", f(x) = f(x, x),...,%,), we define 
the gradient of f to be the vector 


aflx) af(x) ied 
Ox, ° OX,” * Ax , 


vx) =| 


n 


We sometimes use the alternative notation f,(x) for Vf(x). In matrix calculations 
the gradient is considered to be a row vector. 

If f € C’ then we define the Hessian of f at x to be the n x n matrix denoted 
V’ f(x) or F(x) as 


ae 
OX ;OXx; 
Since 
of vf 
OX ;OXx; 7 OX OX; : 


it is easily seen that the Hessian is symmetric. 
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For a vector-valued function f = (f,, f5,...,f,,) the situation is similar. If 
f < C!, the first derivative is defined as the m x n matrix 


Vf (x) = | : 


If f € C’ it is possible to define the m Hessians F,(x), F,(x),...,F,,(x) corres- 
ponding to the m component functions. The second derivative itself, for a vector 
function, is a third-order tensor but we do not require its use explicitly. Given any 
N= [A,,Ao,--->Am] € E,,, We note, however, that the real-valued function rf 


m? 


has gradient equal to X’ Vf(x) and Hessian, denoted X’ F(x), equal to 


m 


NF(x) = )°A,F,(x). 


Also see Section 7.4 for a discussion of convex functions. 


Taylor’s Theorem 


A group of results that are used frequently in analysis are referred to under the 
general heading of Taylor’s Theorem or Mean Value Theorems. If f € C! in a 
region containing the line segment [x,, x,], then there is a 0, 0 < 8 < 1 such that 


fle) = fle) + VAGe, + Ox, 0g — x). 
Furthermore, if f € C? then there is a 0,0 < 6 < 1 such that 
F%) = fl%)) + VFC%))(% —x,) 
+ 5 _ x,)/ F(x, + (1 — 0)x,)(x, —X)), 


where F denotes the Hessian of f/f. 


Implicit Function Theorem 
Suppose we have a set of m equations in n variables 
h(x)=0, i=1,2,...,m. 


The implicit function theorem addresses the question as to whether if n —m of 
the variables are fixed, the equations can be solved for the remaining m variables. 
Thus selecting m variables, say x,,X,...,%X,,, we wish to determine if these may 
be expressed in terms of the remaining variables in the form 


Ki = Digi Xmp2>+++9 Xn)» i=1,2,...,m. 
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The functions @,, if they exist, are called implicit functions. 


Theorem. Let x° = (x1, x3, ..., x°) be a point in E" satisfying the properties: 
i) The functions h; € C’,i=1,2,...,m in some neighborhood of x°, for 
some p> 1. 


ii) h,(x°)=0, i=1,2,...,m. 
iii) The m x m Jacobian matrix 


dh, (x°) dh, (x°) 
Ox, Ox 


j= : 
Oh, ie) On, (x°) 
Ox, Ox 


m 


is nonsingular. 


Then there is a neighborhood of x° = Comer x? 


0 n—m 
pemgh, ee” Suck 


that for & = (Xp41.Xmz2>+++»X,) in this neighborhood there are functions 
(8), i=1,2,...,m such that 
i) , € C”. 


i) x= 6,(8"), 1 ee 7 
iii) h;(b;(&), b(&), .--, &,, (&),&) = 0, i=1,2,...,m. 


Example 1. Consider the equation xj +x, =0. A solution is x, =0, x, =0. 
However, in a neighborhood of this solution there is no function ¢@ such that x, = 
(x,). At this solution condition (iii) of the implicit function theorem is violated. 
At any other solution, however, such a ¢ exists. 


Example 2. Let A be an m x n matrix (m < n) and consider the system of linear 
equations Ax = b. If A is partitioned as A = [B, C] where B is m x m then condition 
(iii) is satisfied if and only if B is nonsingular. This condition corresponds, of course, 
exactly with what the theory of linear equations tells us. In view of this example, the 
implicit function can be regarded as a nonlinear generalization of the linear theory. 


o, O Notation 


If g is a real-valued function of a real variable, the notation g(x) = O(x) means 
that g(x) goes to zero at least as fast as x does. More precisely, it means that there 
is a K >0 such that 


<K as x->0. 
x 


| a(x) 


The notation g(x) = o(x) means that g(x) goes to zero faster than x does; or 
equivalently, that K above is zero. 


Appendix B CONVEX SETS 


B.1 BASIC DEFINITIONS 


Concepts related to convex sets so dominate the theory of optimization that it is 
essential for a student of optimization to have knowledge of their most fundamental 
properties. In this appendix is compiled a brief summary of the most important of 
these properties. 


Definition. A set C in E” is said to be convex if for every x,, xX, € C and 
every real number a, 0 < a < 1, the point ax, +(1—a)x, €C. 


This definition can be interpreted geometrically as stating that a set is convex 
if, given two points in the set, every point on the line segment joining these two 
points is also a member of the set. This is illustrated in Fig. B.1. 

The following proposition shows the certain familiar set operations preserve 
convexity. 


Proposition 1. Convex sets in E” satisfy the following relations: 


i) If C is a convex set and B is a real number, the set 
BC = {x:x=fBe,ceC} 


is convex. 
ii) If C and D are convex sets, then the set 


C+D= {x:x=c+d,ceC,deD} 


is convex. 
iii) The intersection of any collection of convex sets is convex. 


The proofs of these three properties follow directly from the definition of a 
convex set and are left to the reader. The properties themselves are illustrated in 
Fig. B.2. 

Another important concept is that of forming the smallest convex set containing 
a given set. 
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YW, 


convex nonconvex 


Fig. B.1 Convexity 


9R¢ 


Fig. B.2 Properties of convex sets 


Definition. Let S be a subset of E”. The convex hull of S, denoted co(S), is 
the set which is the intersection of all convex sets containing S. The closed 
convex hull of S is defined as the closure of co(S). 


Finally, we conclude this section by defining a cone and a convex cone. A 
convex cone is a special kind of convex set that arises quite frequently. 


i Dv 


Not convex Not convex Convex 


Fig. B.3 Cones 
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Definition. A set C is a cone if x € C implies ax € C for all a > 0. A cone 
that is also convex is a convex cone. 


Some cones are shown in Fig. B.3. Their basic property is that if a point x 
belongs to a cone, then the entire half line from the origin through the point (but 
not the origin itself) also must belong to the cone. 


B.2 HYPERPLANES AND POLYTOPES 


The most important type of convex set (aside from single points) is the hyperplane. 
Hyperplanes dominate the entire theory of optimization, appearing under the guise 
of Lagrange multipliers, duality theory, or gradient calculations. 

The most natural definition of a hyperplane is the logical generalization of 
the geometric properties of a plane in three dimensions. We start by giving this 
geometric definition. For computations and for a concrete description of hyper- 
planes, however, there is an equivalent algebraic definition that is more useful. A 
major portion of this section is devoted to establishing this equivalence. 


Definition. A set V in E” is said to be a linear variety, if, given any x,, x, € V, 
we have Ax, + (1 —A)x, € V for all real numbers X. 


Note that the only difference between the definition of a linear variety and a 
convex Set is that in a linear variety the entire line passing through any two points, 
rather than simply the line segment between them, must lie in the set. Thus in three 
dimensions the nonempty linear varieties are points, lines, two-dimensional planes, 
and the whole space. In general, it is clear that we may speak of the dimension of a 
linear variety. Thus, for example, a point is a linear variety of dimension zero and 
a line is a linear variety of dimension one. In the general case, the dimension of 
a linear variety in E” can be found by translating it (moving it) so that it contains 
the origin and then determining the dimension of the resulting set, which is then a 
subspace of E”. 


Definition. A hyperplane in E” is an (n— 1)-dimensional linear variety. 


We see that hyperplanes generalize the concept of a two-dimensional plane in 
three-dimensional space. They can be regarded as the largest linear varieties in a 
space, other than the entire space itself. 

We now relate this abstract geometric definition to an algebraic one. 


Proposition 2. Let a be a nonzero n-dimensional column vector, and let c be 
a real number. The set 


H={xeE":a'’x=c} 


is a hyperplane in E". 


Proof. It follows directly from the linearity of the equation a’x = c that H is 
a linear variety. Let x, be any vector in H. Translating by —x, we obtain the set 
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M = H ~—x, which is a linear subspace of E”. This subspace consists of all vectors 
x satisfying a’x = 0; in other words, all vectors orthogonal to a. This is clearly an 
(n— 1)-dimensional subspace. ff 


Proposition 3. Let H be a hyperplane in E". Then there is a nonzero n- 
dimensional vector and a constant c such that 


H = {xe E":a’x=c}. 


Proof. Let x, € H and translate by —x, obtaining the set M = H —x,. Since 
H is a hyperplane, M is an (n— 1)-dimensional subspace. Let a be any nonzero 
vector that is orthogonal to this subspace, that is, a belongs to the one-dimensional 
subspace M+. Clearly M = {x:a’x = 0}. Letting c=a’x, we see that if x, ¢ H 
we have x,—x, € M and thus a’x, —a’x, =0 which implies a’x, = c. Thus 
H C {x:a’x =c}. Since H is, by definition, of dimension n— 1 and {x:a’x=c} 
is of dimension n— 1 by Proposition 2, these two sets must be equal. J 


Combining Propositions 2 and 3, we see that a hyperplane is the set of solutions 
to a single linear equation. This is illustrated in Fig. B.4. We now use hyperplanes 
to build up other important classes of convex sets. 


Definition. Let a be a nonzero vector in E” and let c be a real number. 
Corresponding to the hyperplane H = {x : a?’x = c} are the positive and negative 
closed half spaces 


and the positive and negative open half spaces 
H, ={x:a'x>c} 


H_= {x:a’x <c}. 


Fig. B.4 
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Fig. B.5 Polytopes 


It is easy to see that half spaces are convex sets and that the union of H, and 
H_ is the whole space. 


Definition. A set which can be expressed as the intersection of a finite number 
of closed half spaces is said to be a convex polytope. 


We see that convex polytopes are the sets obtained as the family of solutions 
to a set of linear inequalities of the form 


arx<b, 
ae < by 


ax < 6, 

since each individual inequality defines a half space and the solution family is 
the intersection of these half spaces. (If some a; = 0, the resulting set can still, as 
the reader may verify, be expressed as the intersection of a finite number of half 
spaces.) 

Several polytopes are illustrated in Fig. B.5. We note that a polytope may be 
empty, bounded, or unbounded. The case of a nonempty bounded polytope is of 
special interest and we distinguish this case by the following. 


Definition. A nonempty bounded polytope is called a polyhedron. 


B.3) SEPARATING AND SUPPORTING 
HYPERPLANES 


The two theorems in this section are perhaps the most important results related to 
convexity. Geometrically, the first states that given a point outside a convex set, a 
hyperplane can be passed through the point that does not touch the convex set. The 
second, which is a limiting case of the first, states that given a boundary point of a 
convex set, there is a hyperplane that contains the boundary point and contains the 
convex set on one side of it. 
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Theorem I. Let C be a convex set and let y be a point exterior to the closure 


of C. Then there is a vector a such that aly < inf a’x, 
xe 


Proof. Let 
6 = inf |x—y| > 0. 
xeC 


There is an x, on the boundary of C such that |x) — y| = 6. This follows because 
the continuous function f(x) = |x—y| achieves its minimum over any closed and 
bounded set and it is clearly only necessary to consider x in the intersection of the 
closure of C and the sphere of radius 25 centered at y. 

We shall show that setting a = Xx, — y satisfies the conditions of the theorem. 
Let x € C. For any a,0 <a < 1, the point x) + a(x —x,) € C and thus 


[xp +. a(x — x9) — yl? > |x —yl’. 
Expanding, 
2a(Xy —y)" (x —X) +a°|x—Xo|? > 0. 
Thus, considering this as a > 0+, we obtain 
(X) —y)"(K—Xq) > 0 
or, 


(Xo —y)’x > (Xp —y)'X) = (Xp —y)'y+ (Xp — y)' (X =y) 
= (%) —y)"’y +8. 


Setting a = x) —y proves the theorem. J 


The geometrical interpretation of Theorem | is that, given a convex set C anda 
point y exterior to the closure of C, there is a hyperplane containing y that contains 
C in one of its open half spaces. We can easily extend this theorem to include the 
case where y is a boundary point of C. 


Theorem 2. Let C be a convex set and let y be a boundary point of C. Then 
there is a hyperplane containing y and containing C in one of its closed half 
spaces. 


Proof. Let {y,} be a sequence of vectors, exterior to the closure of C, converging 
to y. Let {a,} be the sequence of corresponding vectors constructed according to 
Theorem 1, normalized so that |a,| = 1, such that 


r. ‘ T 
a, y; < infa, x. 
xeC 


B.4 Extreme Points 521 


Since {a,} is a bounded sequence, it has a convergent subsequence {a,}, ke K 
with limit a. For this vector we have for any x € C. 


T. = T. s T 
a y=lima < lima; x = ax. 
y keK ke Ye ~ keK k i 


Definition. A hyperplane containing a convex set C in one of its closed 
half spaces and containing a boundary point of C is said to be a supporting 
hyperplane of C. 


In terms of this definition, Theorem 2 says that, given a convex set C and a 
boundary point y of C, there is a hyperplane supporting C at y. 

It is useful in the study of convex sets to consider the relative interior of a 
convex set C defined as the largest subset of C that contains no boundary points 
of C. 

Another variation of the theorems of this section is the one that follows, which 
is commonly known as the Separating Hyperplane Theorem. 


Theorem 3. Let B and C be convex sets with no common relative interior 
points. (That is the only common points are boundary points.) Then there is 
a hyperplane separating B and D. In particular, there is a nonzero vector a 
such that sup,., a‘ b < inf,.ca’e. 


Proof. Consider the set G = C — B. It is easily shown that G is convex and that 
0 is not a relative interior point of G. Hence, Theorem | or Theorem 2 applies and 
gives the appropriate hyperplane. 


B.4 EXTREME POINTS 


Definition. A point x in a convex set C is said to be an extreme point of C 
if there are no two distinct points x, and x, in C such that x = ax, + (1—a@)x, 
for some a,0 <a <1. 


For example, in E* the extreme points of a square are its four corners; the 
extreme points of a circular disk are all points on the boundary. Note that a linear 
variety consisting of more than one point has no extreme points. 


Lemma lI. Let C be a convex set, H a supporting hyperplane of C, and T the 
intersection of H and C. Every extreme point of T is an extreme point of C. 


Proof. Suppose Xp € T is not an extreme point of C. Then x) = ax, + (1 — a@)x, 
for some x,,X) € C,x, #X,0 <a < 1. Let H be described as H = {x: a’x =c} 
with C contained in its closed positive half space. Then 
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But, since Xp) € H, 
c=a'x, = aa’x, +(1—a)a’x,, 
and thus x, and x, € H. Hence x,, x, € T and x, is not an extreme point of T. ff 


Theorem 4. A closed bounded convex set in E" is equal to the closed convex 
hull of its extreme points. 


Proof. The proof is by induction on the dimension of the space E”. The statement 
is easily seen to be true for n = 1. Suppose that it is true for n— 1. Let C be a closed 
bounded convex set in E”, and let K be the closed convex hull of the extreme 
points of C. We wish to show that K = C. 

Assume there is ye Cy ¢ K. Then by Theorem 1, Section B.3, there is a 
hyperplane separating y and K; that is, there is a #0, such that a’y < inf,-,a’x. 
Let cy = inf,.-(a’x). The number cy is finite and there is an x) € C for which 
a’x, = Co, because by Weierstrass’ Theorem, the continuous function ax achieves 
its minimum over any closed bounded set. Thus the hyperplane H = {x : a’x = co} 
is a supporting hyperplane to C. It is disjoint from K since cy < inf (a’x). 

Let T = HNC. Then T is a bounded closed convex subset of H which can be 
regarded as a space of dimension n— |. T is nonempty, since it contains x9. Thus, 
by the induction hypothesis, 7 contains extreme points; and by Lemma | these are 
also extreme points of C. Thus we have found extreme points of C not in K, which 
is a contradiction. J 


Let us investigate the implications of this theorem for convex polyhedra. We 
recall that a convex polyhedron is a bounded polytope. Being the intersection of 
closed half spaces, a convex polyhedron is also closed. Thus any convex polyhedron 
is the closed convex hull of its extreme points. It can be shown (see Section 2.5) 
that any polytope has at most a finite number of extreme points and hence a convex 
polyhedron is equal to the convex hull of a finite number of points. The converse 
can also be established, yielding the following two equivalent characterizations. 


Theorem 5. A convex polyhedron can be described either as a bounded 
intersection of a finite number of closed half spaces, or as the convex hull of 
a finite number of points. 


Appendix C GAUSSIAN 
ELIMINATION 


This appendix describes the method for solving systems of linear equations that 
has proved to be, not only the most popular, but also the fastest and least 
susceptible to round-off error accumulation—the method of Gaussian elimination. 
Attention is directed toward explaining this classical elimination technique itself 
and its relation to the theory of LU decomposition of a non-singular square 
matrix. 

We first note how easily triangular systems of equations can be solved. Thus 
the system 


ay%) =b, 
Ay, X1 + Ay2Xy =b, 


Ani %] + Ay2Xo ae Any Xp = B, 
can be solved recursively as follows: 


X; =b,/ay, 


Xq = (by — a,x) /Ay 


x) = (b,, Any X] ~ AngXq ++ — GX) Dans 


provided that each of the diagonal terms a;;,i=1,2,..., is nonzero (as they must 
be if the system is nonsingular). This observation motivates us to attempt to reduce 
an arbitrary system of equations to a triangular one. 
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Definition. A square matrix C = [c,,] is said to be lower triangular if c,; =0 
for i < j. Similarly, C is said to be upper triangular if c;; = 0 for i> j. 


In matrix notation, the idea of Gaussian elimination is to somehow find a 
decomposition of a given n x n matrix A in the form A = LU where L is a lower 
triangular and U an upper triangular matrix. The system 


Ax=b (C.1) 
can then be solved by solving the two triangular systems 
Ly=b, Ux=y. (C.2) 


The calculation of L and U together with solution of the first of these systems is 
usually referred to as forward elimination, while solution of the second triangular 
system is called back substitution. 

Every nonsingular square matrix A has an LU decomposition, provided that 
interchanges of rows of A are introduced if necessary. This interchange of rows 
corresponds to a simple reordering of the system of equations, and hence amounts 
to no loss of generality in the method. For simplicity of notation, however, we 
assume that no such interchanges are required. 

We turn now to the problem of explicitly determining L and U, by elimination, 
for a nonsingular matrix A. Given the system, we attempt to transform it so that 
zeros appear below the main diagonal. Assuming that a,,; 4 0 we subtract multiples 
of the first equation from each of the others in order to get zeros in the first column 
below a,,. If we define m,; = a,;/a,, and let 


My 1 


the resulting new system of equations can be expressed as 
A@x = b® 
with 
A? =M,A, b® =M)b. 


The matrix A® = [a\°] has a) =0,k > 1. 
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Next, assuming ae) #0, multiples of the second equation of the new system 
are subtracted from equations 3 through n to yield zeros below av in the second 
column. This is equivalent in premultiplying A® and b® by 


1 0 
0 1 
* M32 1 
M,= . ~m42 ; 
—MNyg 1 


where m, = a\) /aS). This yields A® =M,A® and b® = MLA, 

Proceeding in this way we obtain A”) =M,_,M,_,...M,A, an upper trian- 
gular matrix which we denote by U. The matrix M=M,,_,M,,_,...M, is a lower 
triangular matrix, and since MA = U we have A= M”!U. The matrix L = M7! is 
also lower triangular and becomes the L of the desired LU decomposition for A. 

The representation for L can be made more explicit by noting that M;' is the 
same as M, except that the off-diagonal terms have the opposite sign. Furthermore, 
we have L=M~'!=M,'M,'...M7!, which is easily verified to be 


1 0) 
my, 1 
m3; M3, 1 
L = ° e ry 7 
My} Mrz 2° 1 


Hence L can be evaluated directly in terms of the calculations required by the elimi- 
nation process. Of course, an explicit representation for M = L~! would actually 
be more useful but a simple representation for M does not exist. Thus we content 
ourselves with the explicit representation for L and use it in (C.2). 

If the original system (C.1) is to be solved for a single b vector, the vector 
y satisfying Ly = b is usually calculated simultaneously with L in the form 
y = b” = Mb. The final solution x is then found by a single back substitution, 
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from Ux = y. Once the LU decomposition of A has been obtained, however, the 
solution corresponding to any right-hand side can be found by solving the two 
systems (C.2). 

In practice, the diagonal element a of A“ may become zero or very close 
to zero. In this case it is important that the kth row be interchanged with a row 
that is below it. Indeed, for considerations of numerical accuracy, it is desirable to 
continuously introduce row interchanges of this type in such a way to insure |m;;| < 
1 for all i, 7. If this is done, the Gaussian elimination procedure has exceptionally 
good stability properties. 


BIBLIOGRAPHY 


Abadie, J., and Carpentier, J., “Generalization of the Wolfe Reduced Gradient Method 
to the Case of Nonlinear Constraints,” in Optimization, R. Fletcher (editor), 
Academic Press, London, 1969, pages 37-47 


Akaike, H., “On a Successive Transformation of Probability Distribution and Its 
Application to the Analysis of the Optimum Gradient Method,” Ann. Inst. Statist. 
Math. 11, 1959, pages 1-17 


Alizadeh, F., Combinatorial Optimization with Interior Point Methods and Semi- 
definite Matrices, Ph.D. Thesis, University of Minnesota, Minneapolis, Minn., 1991. 


Alizadeh, F., “Optimization Over the Positive Semi-definite Cone: Interior—point 
Methods and Combinatorial Applications,” in Advances in Optimization and Parallel 
Computing, P. M. Pardalos (editor), North Holland, Amsterdam, The Netherlands, 
1992, pages 1-25 


Andersen, E. D., and Ye, Y., “On a Homogeneous Algorithm for the Monotone 
Complementarity Problem,” Math. Prog. 84, 1999, pages 375-400 


Anstreicher, K. M., den Hertog, D., Roos, C., and Terlaky, T., “A Long Step 
Barrier Method for Convex Quadratic Programming,” Algorithmica, 10, 1993, pages 
365-382 


Antosiewicz, H. A., and Rheinboldt, W. C., “Numerical Analysis and Functional 
Analysis,” Chapter 14 in Survey of Numerical Analysis, J. Todd (editor), McGraw- 
Hill, New York, 1962 


Armijo, L., “Minimization of Functions Having Lipschitz Continuous First-Partial 
Derivatives,” Pacific J. Math. 16, 1, 1966, pages 1-3 


Atrow, K. J., and Hurwicz, L., “Gradient Method for Concave Programming, I.: Local 
Results,” in Studies in Linear and Nonlinear Programming, K. J. Arrow, L. Hurwicz, 
and H. Uzawa (editors), Stanford University Press, Stanford, Calif., 1958 


Bartels, R. H., “A Numerical Investigation of the Simplex Method,” Technical Report 
No. CS 104, July 31, 1968, Computer Science Dept., Stanford University, Stanford, 
Calif. 

Bartels, R. H., and Golub, G. H., “The Simplex Method of Linear Programming Using 
LU Decomposition,” Comm. ACM 12, 5, 1969, pages 266-268 

Bayer, D. A., and Lagarias, J. C., “The Nonlinear Geometry of Linear Programming, 
Part I: Affine and Projective Scaling Trajectories,” Trans. Ame. Math. Soc., 314, 2, 
1989, pages 499-526 


527 


528 Bibliography 


[B4] Bayer, D. A., and Lagarias, J. C., “The Nonlinear Geometry of Linear Programming, 
Part Il; Legendre Transform Coordinates,” Trans. Ame. Math. Soc. 314, 2, 1989, 
pages 527-581 


[B5] Bazaraa, M. S., and Jarvis, J. J., Linear Programming and Network Flows, John Wiley, 
New York, 1977 


[B6] Bazaraa, M. S., Jarvis, J. J., and Sherali, H. F., Linear Programming and Network 
Flows, Chapter 8.4 : Karmarkar’s Projective Algorithm, pages 380-394, Chapter 8.5: 
Analysis of Karmarkar’s algorithm, pages 394-418. John Wiley & Sons, New York, 
second edition, 1990 


[B7] Beale, E. M. L., “Numerical Methods,” in Nonlinear Programming, J. Abadie (editor), 
North-Holland, Amsterdam, 1967 

[B8] Beckman, F. S., “The Solution of Linear Equations by the Conjugate Gradient 
Method,” in Mathematical Methods for Digital Computers 1, A. Ralston and 
H. S. Wilf (editors), John Wiley, New York, 1960 


[B9] Bertsekas, D. P., “Partial Conjugate Gradient Methods for a Class of Optimal Control 
Problems,” JEEE Transactions on Automatic Control, 1973, pages 209-217 


[B10] Bertsekas, D. P., “Multiplier Methods: A Survey,” Automatica 12, 2, 1976, pages 
133-145 

[B11] Bertsekas, D. P., Constrained Optimization and Lagrange Multiplier Methods, 
Academic Press, New York, 1982 

[B12] Bertsekas, D. P., Nonlinear Programming, Athena Scientific, Belmont, Mass., 1995 


[B13] Bertsimas, D. M., and Tsitsiklis, J. N., Linear Optimization. Athena Scientific, 
Belmont, Mass., 1997 


[B14] Biggs, M. C., “Constrained Minimization Using Recursive Quadratic Programming: 
Some Alternative Sub-Problem Formulations,” in Towards Global Optimization, 
L. C. W. Dixon and G. P. Szego (editors), North-Holland, Amsterdam, 1975 

[B15] Biggs, M. C., “On the Convergence of Some Constrained Minimization Algorithms 
Based on Recursive Quadratic Programming,” J. Inst. Math. Applics. 21, 1978, 
pages 67-81 

[B16] Birkhoff, G., “Three Observations on Linear Algebra,” Rev. Univ. Nac. Tucumdn, 
Ser. A., 5, 1946, pages 147-151 

[B17] Biswas, P., and Ye, Y., “Semidefinite Programming for Ad Hoc Wireless Sensor 
Network Localization,” Proc. of the 3rd IPSN 2004, pages 46-54 

[B18] Bixby, R. E., “Progress in Linear Programming,” ORSA J. on Comput. 6, 1, 1994, 
pages 15-22 

[B19] Bland, R. G., “New Finite Pivoting Rules for the Simplex Method,” Math. Oper. 
Res. 2, 2, 1977, pages 103-107 

[B20] Bland, R. G., Goldfarb, D., and Todd, M. J., “The Ellipsoidal Method: a Survey,” 
Oper. Res. 29, 1981, pages 1039-1091 

[B21] Blum, L., Cucker, F., Shub, M., and Smale, S., Complexity and Real Computation, 
Springer-Verlag, 1996 

[B22] Boyd, S., Ghaoui, L. E., Feron, E., and Balakrishnan, V., Linear Matrix Inequalities 
in System and Control Science, SIAM Publications. SIAM, Philadelphia, 1994 


[B23] Boyd, S., and Vandenberghe, L., Convex Optimization, Cambridge University Press, 
Cambridge, 2004 


Bibliography 529 


[B24] Broyden, C. G., “Quasi-Newton Methods and Their Application to Function 
Minimization,” Math. Comp. 21, 1967, pages 368-381 


[B25] Broyden, C. G., “The Convergence of a Class of Double Rank Minimization 
Algorithms: Parts I and II,” J. Inst. Maths. Appins. 6, 1970, pages 76-90 and 
222-231 


[B26] Butler, T., and Martin, A. V., “On a Method of Courant for Minimizing Functionals,” 
J. Math. and Phys. 41, 1962, pages 291-299 


[C1] Carroll, C. W., “The Created Response Surface Technique for Optimizing Nonlinear 
Restrained Systems,” Oper. Res. 9, 12, 1961, pages 169-184 


[C2] Charnes, A., “Optimality and Degeneracy in Linear Programming,” Econometrica 20, 
1952, pages 160-170 


[C3] Charnes, A., and Lemke, C. E., “The Bounded Variables Problem,’ ONR Research 
Memorandum 10, Graduate School of Industrial Administration, Carnegie Institute 
of Technology, Pittsburgh, Pa., 1954 


[C4] Cohen, A., “Rate of Convergence for Root Finding and Optimization Algorithms,” 
Ph.D. Dissertation, University of California, Berkeley, 1970 


[C5] Cook, S. A., “The Complexity of Theorem-Proving Procedures,” Proc. 3rd ACM 
Symposium on the Theory of Computing, 1971, pages 151-158 


[C6] Cottle, R. W., Linear Programming, Lecture Notes for MS&E 310, Stanford 
University, Stanford, 2002 


[C7] Cottle, R., Pang, J. S., and Stone, R. E., The Linear Complementarity Problem, 


Chapter 5.9 : Interior—Point Methods, Academic Press, Boston, 1992, pages 
461-475 


[C8] Courant, R., “Calculus of Variations and Supplementary Notes and Exercises” 
(mimeographed notes), supplementary notes by M. Kruskal and H. Rubin, revised 
and amended by J. Moser, New York University, 1962 


[C9] Crockett, J. B., and Chernoff, H., “Gradient Methods of Maximization,” Pacific J. 
Math. 5, 1955, pages 33-50 


[C10] Curry, H., “The Method of Steepest Descent for Nonlinear Minimization Problems,” 
Quart. Appl. Math. 2, 1944, pages 258-261 


[D1] Daniel, J. W., “The Conjugate Gradient Method for Linear and Nonlinear Operator 
Equations,” SIAM J. Numer. Anal. 4, 1, 1967, pages 10-26 


[D2] Dantzig, G. B., “Maximization of a Linear Function of Variables Subject to Linear 
Inequalities,” Chap. XXI of “Activity Analysis of Production and Allocation,” 
Cowles Commission Monograph 13, T. C. Koopmans (editor), John Wiley, New 
York, 1951 


[D3] Dantzig, G. B., “Application of the Simplex Method to a Transportation Problem,” 
in Activity Analysis of Production and Allocation, T. C. Koopmans (editor), John 
Wiley, New York, 1951, pages 359-373 


[D4] Dantzig, G. B., “Computational Algorithm of the Revised Simplex Method,” RAND 
Report RM-1266, The RAND Corporation, Santa Monica, Calif., 1953 


[D5] Dantzig, G. B., “Variables with Upper Bounds in Linear Programming,” RAND Report 
RM-1271, The RAND Corporation, Santa Monica, Calif., 1954 


[D6] Dantzig, G. B., Linear Programming and Extensions, Princeton University Press, 
Princeton, N.J., 1963 


530 Bibliography 


[D7] Dantzig, G. B., Ford, L. R. Jr., and Fulkerson, D. R., “A Primal-Dual Algorithm,” 
Linear Inequalities and Related Systems, Annals of Mathematics Study 38, Princeton 
University Press, Princeton, N.J., 1956, pages 171-181 

[D8] Dantzig, G. B., Orden, A., and Wolfe, P., “Generalized Simplex Method for 
Minimizing a Linear Form under Linear Inequality Restraints,” RAND Report RM- 
1264, The RAND Corporation, Santa Monica, Calif., 1954 

[D9] Dantzig, G. B., and Thapa, M. N., Linear Programming 1: Introduction, Springer- 
Verlag, New York, 1997 

[D10] Dantzig, G. B., and Thapa, M. N., Linear Programming 2: Theory and Extensions, 
Springer-Verlag, New York, 2003 

[D11] Dantzig, G. B., and Wolfe, P., “Decomposition Principle for Linear Programs,” Oper. 
Res. 8, 1960, pages 101-111 

[D12] Davidon, W. C., “Variable Metric Method for Minimization,” Research and Devel- 
opment Report ANL-5990 (Ref.) U.S. Atomic Energy Commission, Argonne 
National Laboratories, 1959 

[D13] Davidon, W. C., “Variance Algorithm for Minimization,” Computer J. 10, 1968, pages 
406-410 

[D14] Dembo, R. S., Eisenstat, S. C., and Steinhaug, T., “Inexact Newton Methods,” SJAM 
J. Numer. Anal. 19, 2, 1982, pages 400-408 

[D15] Dennis, J. E., Jr., and Moré, J. J., “Quasi-Newton Methods, Motivation and Theory,” 
SIAM Review 19, 1977, pages 46-89 

[D16] Dennis, J. E., Jr., and Schnabel, R. E., “Least Change Secant Updates for Quasi-Newton 
Methods,” SIAM Review 21, 1979, pages 443-469 

[D17] Dixon, L. C. W., “Quasi-Newton Algorithms Generate Identical Points,” Math. Prog. 
2, 1972, pages 383-387 

[E1] Eaves, B. C., and Zangwill, W. I., “Generalized Cutting Plane Algorithms,” Working 
Paper No. 274, Center for Research in Management Science, University of 
California, Berkeley, July 1969 

[E2] Everett, H., III, “Generalized Lagrange Multiplier Method for Solving Problems of 
Optimum Allocation of Resources,” Oper. Res. 11, 1963, pages 399-417 

[Fl] Faddeev, D. K., and Faddeeva, V. N., Computational Methods of Linear Algebra, 
W. H. Freeman, San Francisco, Calif., 1963 

[F2] Fang, S. C., and Puthenpura, S., Linear Optimization and Extensions, Prentice-Hall, 
Englewood Cliffs., N.J., 1994 

[F3] Fenchel, W., “Convex Cones, Sets, and Functions,” Lecture Notes, Dept. of Mathe- 
matics, Princeton University, Princeton, N.J., 1953 

[F4] Fiacco, A. V., and McCormick, G. P., Nonlinear Programming: Sequential Uncon- 
strained Minimization Techniques, John Wiley, New York, 1968 

[F5] Fiacco, A. V., and McCormick, G. P., Nonlinear Programming: Sequential Uncon- 
strained Minimization Techniques, John Wiley & Sons, New York, 1968. 
Reprint: Volume 4 of SIAM Classics in Applied Mathematics, SIAM Publications, 
Philadelphia, PA, 1990 

[F6] Fletcher, R., “A New Approach to Variable Metric Algorithms,” Computer J. 13, 13, 
1970, pages 317-322 


Bibliography 531 


[F7] Fletcher, R., “An Exact Penalty Function for Nonlinear Programming with Inequal- 
ities,” Math. Prog. 5, 1973, pages 129-150 

[F8] Fletcher, R., “Conjugate Gradient Methods for Indefinite Systems,” Numerical 
Analysis Report, 11, Department of Mathematics, University of Dundee, Scotland, 
Sept. 1975 

[F9] Fletcher, R., Practical Methods of Optimization 1: Unconstrained Optimization, John 
Wiley, Chichester, 1980 

[F10] Fletcher, R., Practical Methods of Optimization 2: Constrained Optimization, John 
Wiley, Chichester, 1981 

[F11] Fletcher, R., and Powell, M. J. D., “A Rapidly Convergent Descent Method for 
Minimization,” Computer J. 6, 1963, pages 163-168 

[F12] Fletcher, R., and Reeves, C. M., “Function Minimization by Conjugate Gradients,” 
Computer J. 7, 1964, pages 149-154 

[F13] Ford, L. K. Jr., and Fulkerson, D. K., Flows in Networks, Princeton University Press, 
Princeton, New Jersey, 1962 

[F14] Forsythe, G. E., “On the Asymptotic Directions of the s-Dimensional Optimum 
Gradient Method,” Numerische Mathematik 11, 1968, page 57-76 

[F15] Forsythe, G. E., and Moler, C. B., Computer Solution of Linear Algebraic Systems, 
Prentice-Hall, Englewood Cliffs, N.J., 1967 

[F16] Forsythe, G. E., and Wasow, W. R., Finite-Difference Methods for Partial Differential 
Equations, John Wiley, New York, 1960 


[F17] Fox, K., An Introduction to Numerical Linear Algebra, Clarendon Press, Oxford, 1964 


[F18] Freund, R. M., “Polynomial-time Algorithms for Linear Programming Based Only on 
Primal Scaling and Projected Gradients of a Potential function,” Math. Prog., 51, 
1991, pages 203-222 

[F19] Frisch, K. R., “The Logarithmic Potential Method for Convex Programming,” 
Unpublished Manuscript, Institute of Economics, University of Oslo, Oslo, 
Norway, 1955 

[G1] Gabay, D., “Reduced Quasi-Newton Methods with Feasibility Improvement for 
Nonlinear Constrained Optimization,” Math. Prog. Studies 16, North-Holland, 
Amsterdam, 1982, pages 18-44 

[G2] Gale, D., The Theory of Linear Economic Models, McGraw-Hill, New York, 1960 


[G3] Garcia-Palomares, U. M., and Mangasarian, O. L., “Superlinearly Convergent Quasi- 
Newton Algorithms for Nonlinearly Constrained Optimization Problems,” Math. 
Prog. 11, 1976, pages 1-13 

[G4] Gass, S. L., Linear Programming, McGraw-Hill, third edition, New York, 1969 

[G5] Gill, P. E., Murray, W., Saunders, M. A., Tomlin, J. A. and Wright, M. H. (1986), 
“On projected Newton barrier methods for linear programming and an equivalence 
to Karmarkar’s projective method,” Math. Prog. 36, 183-209. 

[G6] Gill, P. E., and Murray, W., “Quasi-Newton Methods for Unconstrained Optimization,” 
J. Inst. Maths. Applics 9, 1972, pages 91-108 

[G7] Gill, P. E., Murray, W., and Wright, M. H., Practical Optimization, Academic Press, 
London, 1981 


532 Bibliography 


[G8] Goemans, M. X., and Williamson, D. P., “Improved Approximation Algorithms for 
Maximum Cut and Satisfiability Problems using Semidefinite Programming,” J. 
Assoc. Comput. Mach. 42, 1995, pages 1115-1145 

[G9] Goldfarb, D., “A Family of Variable Metric Methods Derived by Variational Means,” 
Maths. Comput. 24, 1970, pages 23-26 

[G10] Goldfarb, D., and Todd, M. J., “Linear Programming,” in Optimization, volume | of 
Handbooks in Operations Research and Management Science, G. L. Nemhauser, 
A. H. G. Rinnooy Kan, and M. J. Todd, (editors), North Holland, Amsterdam, The 
Netherlands, 1989, pages 141-170 

[G11] Goldfarb, D., and Xiao, D., “A Primal Projective Interior Point Method for Linear 
Programming,” Math. Prog., 51, 1991, pages 17-43 

[G12] Goldstein, A. A., “On Steepest Descent,” STAM J. on Control 3, 1965, pages 147-151 

[G13] Gonzaga, C. C., “An Algorithm for Solving Linear Programming Problems in O(n3L) 
Operations,” in Progress in Mathematical Programming : Interior Point and 


Related Methods, N. Megiddo, (editor), Springer Verlag, New York, 1989, pages 
1-28 


[G14] Gonzaga, C. C., and Todd, M. J., “An O(./nL)-Iteration Large—-step Primal—dual 
Affine Algorithm for Linear Programming,” SIAM J. Optimization 2, 1992, pages 
349-359 


[G15] Greenstadt, J., “Variations on Variable Metric Methods,” Maths. Comput. 24, 1970, 
pages 1-22 

[H1] Hadley, G., Linear Programming, Addison-Wesley, Reading, Mass., 1962 

[H2] Hadley, G., Nonlinear and Dynamic Programming, Addison-Wesley, Reading, Mass., 
1964 

[H3] Han, S. P., “A Globally Convergent Method for Nonlinear Programming,” J. Opt. 
Theo. Appl. 22, 3, 1977, pages 297-309 

[H4] Hancock, H., Theory of Maxima and Minima, Ginn, Boston, 1917 

[H5] Hartmanis, J., and Stearns, R. E., “On the Computational Complexity of Algorithms,” 
Trans. A.M.S 117, 1965, pages 285-306 


[H6] den Hertog, D., Interior Point Approach to Linear, Quadratic and Convex 
Programming, Algorithms and Complexity, Ph.D. Thesis, Faculty of Mathematics 
and Informatics, TU Delft, NL—2628 BL Delft, The Netherlands, 1992 


[H7] Hestenes, M. R., “The Conjugate Gradient Method for Solving Linear Systems,” Proc. 
Of Symposium in Applied Math. V1, Num. Anal., 1956, pages 83-102 

[H8] Hestenes, M. R., “Multiplier and Gradient Methods,” J. Opt. Theo. Appl. 4, 5, 1969, 
pages 303-320 

[H9] Hestenes, M. R., Conjugate-Direction Methods in Optimization, Springer-Verlag, 
Berlin, 1980 


[H10] Hestenes, M. R., and Stiefel, E. L., “Methods of Conjugate Gradients for Solving 
Linear Systems,” J. Res. Nat. Bur. Standards, Section B, 49, 1952, pages 
409-436 


[H11] Hitchcock, F. L., “The Distribution of a Product from Several Sources to Numerous 
Localities,” J. Math. Phys. 20, 1941, pages 224-230 


Bibliography 533 


[H12] Huard, P., “Resolution of Mathematical Programming with Nonlinear Constraints by 
the Method of Centers,” in Nonlinear Programming, J. Abadie, (editor), North 
Holland, Amsterdam, The Netherlands, 1967, pages 207-219 


[H13] Huang, H. Y., “Unified Approach to Quadratically Convergent Algorithms for Function 
Minimization,” J. Opt. Theo. Appl. 5, 1970, pages 405-423 


? 


[H14] Hurwicz, L., “Programming in Linear Spaces,” in Studies in Linear and Nonlinear 
Programming, K. J. Arrow, L. Hurwicz, and H. Uzawa, (editors), Stanford 
University Press, Stanford, Calif., 1958 


[Il] Isaacson, E., and Keller, H. B., Analysis of Numerical Methods, John Wiley, New 
York, 1966 


[J1] Jacobs, W., “The Caterer Problem,” Naval Res. Logist. Quart. 1, 1954, pages 154-165 

[J2] Jarre, F., “Interior—point methods for convex programming,” Applied Mathematics & 
Optimization, 26:287-311, 1992. 

[K1] Karlin, S., Mathematical Methods and Theory in Games, Programming, and 
Economics, Vol. 1, Addison-Wesley, Reading, Mass., 1959 


[K2] Karmarkar, N. K., “A New Polynomial-time Algorithm for Linear Programming,” 
Combinatorica 4, 1984, pages 373-395 

[K3] Kelley, J. E., “The Cutting-Plane Method for Solving Convex Programs,” J. Soc. 
Indus. Appl. Math. VY, 4, 1960, pages 703-712 

[K4] Khachiyan, L. G., “A Polynomial Algorithm for Linear Programming,” Doklady Akad. 
Nauk USSR 244, 1093-1096, 1979, pages 1093-1096, Translated in Soviet Math. 
Doklady 20, 1979, pages 191-194 

[K5] Klee, V., and Minty, G. J., “How Good is the Simplex Method,” in Inequalities IIT, 
O. Shisha, (editor), Academic Press, New York, N.Y., 1972 

[K6] Kojima, M., Mizuno, S., and Yoshise, A., “A Polynomial-time Algorithm for a Class 
of Linear Complementarity Problems,” Math. Prog. 44, 1989, pages 1-26 

[K7] Kojima, M., Mizuno, S., and Yoshise, A., “An O(./nL) Iteration Potential Reduction 
Algorithm for Linear Complementarity Problems,” Math. Prog. 50, 1991, pages 
331-342 

[K8] Koopmans, T. C., “Optimum Utilization of the Transportation System,” Proceedings 
of the International Statistical Conference, Washington, D.C., 1947 

[K9] Kowalik, J., and Osborne, M. R., Methods for Unconstrained Optimization Problems, 
Elsevier, New York, 1968 

[K10] Kuhn, H. W., “The Hungarian Method for the Assignment Problem,” Naval Res. 
Logist. Quart. 2, 1955, pages 83-97 

[K11] Kuhn, H. W., and Tucker, A. W., “Nonlinear Programming,” in Proceedings of the 
Second Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman 
(editor), University of California Press, Berkeley and Los Angeles, Calif., 1961, 
pages 481-492 

[L1] Lanczos, C., Applied Analysis, Prentice-Hall, Englewood Cliffs, N.J., 1956 


[L2] Lawler, E., Combinatorial Optimization: Networks and Matroids, Holt, Rinehart, and 
Winston, New York, 1976 


[L3] Lemarechal, C. and Mifflin, R. Nonsmooth Optimization, IIASA Proceedings II, 
Pergamon Press, Oxford, 1978 


534 Bibliography 


[L4] Lemke, C. E., “The Dual Method of Solving the Linear Programming Problem,” Naval 
Research Logistics Quarterly 1, 1, 1954, pages 36-47 


[L5] Levitin, E. S., and Polyak, B. T., “Constrained Minimization Methods,” Zh. vychisl. 
Mat. mat. Fiz. 6, 5, 1966, pages 787-823 


[L6] Loewner, C., “Uber monotone Matrixfunktionen,” Math. Zeir. 38, 1934, pages 
177-216. Also see C. Loewner, “Advanced matrix theory,” mimeo notes, Stanford 
University, 1957 


[L7] Lootsma, F. A., Boundary Properties of Penalty Functions for Constrained 
Minimization, Doctoral Dissertation, Technical University, Eindhoven, The Nether- 
lands, May 1970 


[L8] Luenberger, D. G., Optimization by Vector Space Methods, John Wiley, New York, 
1969 


[L9] Luenberger, D. G., “Hyperbolic Pairs in the Method of Conjugate Gradients,” SIAM 
J. Appl. Math. 17, 6, 1969, pages 1263-1267 


[L10] Luenberger, D. G., “A Combined Penalty Function and Gradient Projection Method for 
Nonlinear Programming,” Internal Memo, Dept. of Engineering-Economic Systems, 
Stanford University, June 1970 


[L11] Luenberger, D. G., “The Conjugate Residual Method for Constrained Minimization 
Problems,” SJAM J. Numer. Anal. 7, 3, 1970, pages 390-398 


[L12] Luenberger, D. G., “Control Problems with Kinks,” JEEE Trans. on Aut. Control 
AC-15, 5, 1970, pages 570-575 


[L13] Luenberger, D. G., “Convergence Rate of a Penalty-Function Scheme,” J. Optimization 
Theory and Applications 7, 1, 1971, pages 39-51 


[L14] Luenberger, D. G., “The Gradient Projection Method Along Geodesics,” Management 
Science 18, 11, 1972, pages 620-631 


[L15] Luenberger, D. G., Introduction to Linear and Nonlinear Programming, First Edition, 
Addison-Wesley, Reading, Mass., 1973 


[L16] Luenberger, D. G., Linear and Nonlinear Programming, 2nd edtion. Addison-Wesley, 
Reading, 1984 


[L17] Luenberger, D. G., “An Approach to Nonlinear Programming,” J. Optimi. Theo. Appl. 
11, 3, 1973, pages 219-227 


[L18] Luo, Z. Q., Sturm, J., and Zhang, S., “Conic Convex Programming and Self-dual 
Embedding,” Optimization Methods and Software, 14, 2000, pages 169-218 


[L19] Lustig, I. J., Marsten, R. E., and Shanno, D. F., “On Implementing Mehrotra’s 
Predictor-corrector Interior Point Method for Linear Programming,” SIAM J. 
Optimization 2, 1992, pages 435-449 


[M1] Maratos, N., “Exact Penalty Function Algorithms for Finite Dimensional and Control 
Optimization Problems,” Ph.D. Thesis, Imperial College Sci. Tech., Univ. of 
London, 1978 


[M2] McCormick, G. P., “Optimality Criteria in Nonlinear Programming,” Nonlinear 
Programming, SIAM-AMS Proceedings, IX, 1976, pages 27-38 


[M3] McLinden, L., “The Analogue of Moreau’s Proximation Theorem, with Applications 
to the Nonlinear Complementarity Problem,” Pacific J. Math. 88, 1980, pages 
101-161 


Bibliography 535 


[M4] Megiddo, N., “Pathways to the Optimal Set in Linear Programming,” in Progress in 
Mathematical Programming : Interior Point and Related Methods, N. Megiddo, 
(editor), Springer Verlag, New York, 1989, pages 131-158 


[M5] Mehrotra, S., “On the Implementation of a Primal—dual Interior Point Method,” SIAM 
J. Optimization 2, 4, 1992, pages 575-601 


[M6] Mizuno, S., Todd, M. J., and Ye, Y., “On Adaptive Step Primal—dual Interior— 
point Algorithms for Linear Programming,” Math. Oper. Res. 18, 1993, pages 
964-981 


[M7] Monteiro, R. D. C., and Adler, I., “Interior Path Following Primal—dual Algorithms : 
Part I: Linear Programming,” Math. Prog. 44, 1989, pages 27-41 

[M8] Morrison, D. D., “Optimization by Least Squares,” SIAM J., Numer. Anal. 5, 1968, 
pages 83-88 

[M9] Murtagh, B. A., Advanced Linear Programming, McGraw-Hill, New York, 1981 
[M10]Murtagh, B. A., and Sargent, R. W. H., “A Constrained Minimization Method with 


Quadratic Convergence,” Chapter 14 in Optimization, R. Fletcher (editor), Academic 
Press, London, 1969 


[M11]Murty, K. G., Linear and Combinatorial Programming, John Wiley, New York, 1976 

[M12]Murty, K. G., Linear Complementarity, Linear and Nonlinear Programming, volume 3 
of Sigma Series in Applied Mathematics, chapter 11.4.1 : The Karmarkar’s 
Algorithm for Linear Programming, Heldermann Verlag, Nassauische Str. 26, D— 
1000 Berlin 31, Germany, 1988, pages 469-494 

[N1] Nash, S. G., and Sofer, A., Linear and Nonlinear Programming, New York, McGraw- 

Hill Companies, Inc., 1996 

[N2] Nesterov, Y., and Nemirovskii, A., Interior Point Polynomial Methods in Convex 

Programming: Theory and Algorithms, SIAM Publications. SIAM, Philadelphia, 

1994 

[N3] Nesterov, Y. and Todd, M. J., “Self-scaled Barriers and Interior-point Methods for 

Convex Programming,” Math. Oper. Res. 22, 1, 1997, pages 1-42 


[N4] Nesterov, Y., Introductory Lectures on Convex Optimization: A Basic Course, Kluwer, 
Boston, 2004 


[Ol] Orchard-Hays, W., “Background Development and Extensions of the Revised Simplex 
Method,” RAND Report RM-1433, The RAND Corporation, Santa Monica, Calif., 
1954 


[02] Orden, A., Application of the Simplex Method to a Variety of Matrix Problems, 
pages 28-50 of Directorate of Management Analysis: “Symposium on Linear 
Inequalities and Programming,” A. Orden and L. Goldstein (editors), DCS/ 
Comptroller, Headquarters, U.S. Air Force, Washington, D.C., April 1952 

[03] Orden, A., “The Transshipment Problem,” Management Science 2, 3, April 1956, 
pages 276-285 

[04] Oren, S. S., “Self-Scaling Variable Metric (SSVM) Algorithms II: Implementation and 
Experiments,” Management Science 20, 1974, pages 863-874 

[05] Oren, S. S., and Luenberger, D. G., “Self-Scaling Variable Metric (SSVM) Algoriths I: 


Criteria and Sufficient Conditions for Scaling a Class of Algorithms,” Management 
Science 20, 1974, pages 845-862 


536 


[P9] 


Bibliography 


Oren, S. S., and Spedicato, E., “Optimal Conditioning of Self-Scaling Variable Metric 
Algorithms,” Math. Prog. 10, 1976, pages 70-90 


Ortega, J. M., and Rheinboldt, W. C., Iterative Solution of Nonlinear Equations in 
Several Variables, Academic Press, New York, 1970 


Paige, C. C., and Saunders, M. A., “Solution of Sparse Indefinite Systems of Linear 
Equations,” SJAM J. Numer. Anal. 12, 4, 1975, pages 617-629 


Papadimitriou, C., and Steiglitz, K., Combinatorial Optimization Algorithms and 
Complexity, Prentice-Hall, Englewood Cliffs, N.J., 1982 


Perry, A., “A Modified Conjugate Gradient Algorithm,” Discussion Paper No. 229, 
Center for Mathematical Studies in Economics and Management Science, North- 
western University, Evanston, Illinois, 1976 


Polak, E., Computational Methods in Optimization: A Unified Approach, Academic 
Press, New York, 1971 


Polak, E., and Ribiere, G., “Note sur la Convergence de Methods de Directions 
Conjugres,” Revue Francaise Informat. Recherche Operationnelle 16, 1969, pages 
35-43 


Powell, M. J. D., “An Efficient Method for Finding the Minimum of a Function of 
Several Variables without Calculating Derivatives,” Computer J. 7, 1964, pages 
155-162 


Powell, M. J. D., “A Method for Nonlinear Constraints in Minimization Problems,” 
in Optimization, R. Fletcher (editor), Academic Press, London, 1969, pages 
283-298 


Powell, M. J. D., “On the Convergence of the Variable Metric Algorithm,” 
Mathematics Branch, Atomic Energy Research Establishment, Harwell, Berkshire, 
England, October 1969 


Powell, M. J. D., “Algorithms for Nonlinear Constraints that Use Lagrangian 
Functions,” Math. Prog. 14, 1978, pages 224-248 


[P10] Pshenichny, B. N., and Danilin, Y. M., Numerical Methods in Extremal Problems 


[R1] 
[R2] 
[R3] 
[R4] 
[R5] 
[R6] 


[S1] 


(translated from Russian by V. Zhitomirsky), MIR Publishers, Moscow, 1978 


Renegar, J., “A Polynomial-time Algorithm, Based on Newton’s Method, for Linear 
Programming,” Math. Prog., 40, 59-93, 1988, pages 59-93 

Renegar, J., A Mathematical View of Interior-Point Methods in Convex Optimization, 
Society for Industrial and Applied Mathematics, Philadelphia, 2001 

Rockafellar, R. T., “The Multiplier Method of Hestenes and Powell Applied to Convex 
Programming,” J. Opt. Theory and Appl. 12, 1973, pages 555-562 

Roos, C., Terlaky, T., and Vial, J.-Ph., Theory and Algorithms for Linear Optimization: 
An Interior Point Approach, John Wiley & Sons, Chichester, 1997 

Rosen, J., “The Gradient Projection Method for Nonlinear Programming, I. Linear 
Contraints,” J. Soc. Indust. Appl. Math. 8, 1960, pages 181-217 


Rosen, J., “The Gradient Projection Method for Nonlinear Programming, II. Non- 
Linear Constraints,” J. Soc. Indust. Appl. Math. 9, 1961, pages 514-532 


Saigal, R., Linear Programming: Modern Integrated Analysis. Kluwer Academic 
Publisher, Boston, 1995 


Bibliography 537 


[S2] Shah, B., Buehler, R., and Kempthorne, O., “Some Algorithms for Minimizing 
a Function of Several Variables,” J. Soc. Indust. Appl. Math. 12, 1964, pages 
74-92 


[S3] Shanno, D. F., “Conditioning of Quasi-Newton Methods for Function Minimization,” 
Maths. Comput. 24, 1970, pages 647-656 


[S4] Shanno, D. F., “Conjugate Gradient Methods with Inexact Line Searches,” Mathe. 
Oper. Res. 3, 3, Aug. 1978, pages 244-256 


[S5]  Shefi, A., “Reduction of Linear Inequality Constraints and Determination of All 
Feasible Extreme Points,” Ph.D. Dissertation, Department of Engineering-Economic 
Systems, Stanford University, Stanford, Calif., October 1969 


[S6] Simonnard, M., Linear Programming, translated by William S. Jewell, Prentice-Hall, 
Englewood Cliffs, N.J., 1966 


[S7] Slater, M., “Lagrange Multipliers Revisited: A Contribution to Non-Linear 
Programming,” Cowles Commission Discussion Paper, Math 403, Nov. 1950 


[S8] Sonnevend, G., “An ‘Analytic Center’ for Polyhedrons and New Classes of Global 
Algorithms for Linear (Smooth, Convex) Programming,” in System Modelling and 
Optimization : Proceedings of the 12th IFIP—Conference held in Budapest, Hungary, 
September 1985, volume 84 of Lecture Notes in Control and Information Sciences, 
A. Prekopa, J., Szelezsan, and B. Strazicky, (editors), Springer Verlag, Berlin, 
Germany, 1986, pages 866-876 


[S9] Stewart, G. W., “A Modification of Davidon’s Minimization Method to Accept 
Difference Approximations of Derivatives,” J.A.C.M. 14, 1967, pages 72-83 


[S10] Stiefel, E. L., “Kernel Polynomials in Linear Algebra and Their Numerical Applica- 
tions,” Nat. Bur. Standards, Appl. Math. Ser., 49, 1958, pages 1-22 


[S11] Sturm, J. F., “Using SeDuMi 1.02, a MATLAB toolbox for optimization over 
symmetric cones,” Optimization Methods and Software, 11&12, 1999, pages 
625-633 


[S12] Sun, J. and Qi, L., “An interior point algorithm of O(./n|In(e)|) iterations for 
C!-convex programming,” Math. Programming, 57, 1992, pages 239-257 


[T1] Tamir, A., “Line Search Techniques Based on Interpolating Polynomials Using 
Function Values Only,” Management Science 22, 5, Jan. 1976, 576-586 


[T2] Tanabe, K., “Complementarity-enforced Centered Newton Method for Mathematical 
Programming,” in New Methods for Linear Programming, K. Tone, (editor), The 
Institute of Statistical Mathematics, 4-6-7 Minami Azabu, Minatoku, Tokyo 106, 
Japan, 1987, pages 118-144 


[T3] Tapia, R. A., “Quasi-Newton Methods for Equality Constrained Optimization: Equiv- 
alents of Existing Methods and New Implementation,” Symposium on Nonlinear 
Programming IIT, O. Mangasarian, R. Meyer, and S. Robinson (editors), Academic 
Press, New York, 1978, pages 125-164 


[T4] Todd, M. J., “A Low Complexity Interior Point Algorithm for Linear Programming,” 
SIAM J. Optimization 2, 1992, pages 198-209 

[T5] Todd, M. J., and Ye, Y., “A Centered Projective Algorithm for Linear Programming. 
Math. Oper. Res., 15, 1990, pages 508-529 


[T6] Tone, K., “Revisions of Constraint Approximations in the Successive QP Method for 
Nonlinear Programming Problems,” Math. Prog. 26, 2, June 1983, pages 144-152 


538 Bibliography 


[T7] Topkis, D. M., “A Note on Cutting-Plane Methods Without Nested Constraint Sets,” 
ORC 69-36, Operations Research Center, College of Engineering, Berkeley, Calif., 
December 1969 


[T8] Topkis, D. M., and Veinott, A. F., Jr., “On the Convergence of Some Feasible Direction 
Algorithms for Nonlinear Programming,” J. SIAM Control 5,2, 1967, pages 268-279 


[T9] Traub, J. F., Iterative Methods for the Solution of Equations, Prentice-Hall, Englewood 
Cliffs, N.J., 1964 


[T10] Tungel, L., “Constant Potential Primal—dual Algorithms: A Framework,” Math. Prog. 
66, 1994, pages 145-159 


[T11] Tutuncu, R., “An Infeasible-interior-point Potential-reduction Algorithm for Linear 
Programming,” Ph.D. Thesis, School of Operations Research and Industrial 
Engineering, Cornell University, Ithaca, N.Y., 1995 


[T12] P. Tseng, “Complexity analysis of a linear complementarity algorithm based on a 
Lyapunov function,” Math. Prog., 53:297-306, 1992. 

[V1] Vaidya, P. M., “An Algorithm for Linear Programming Which Requires 
O((m+n)n?+(m+n)!°nL) Arithmetic Operations,” Math. Prog., 47, 1990, 
pages 175-201, Condensed version in : Proceedings of the 19th Annual ACM 
Symposium on Theory of Computing, 1987, pages 29-38 

[V2] Vandenberghe, L., and Boyd, S., “Semidefinite Programming,” SIAM Review, 38, | 
1996, pages 49-95 

[V3] Vanderbei, R. J., Linear Programming: Foundations and Extensions, Kluwer 
Academic Publishers, Boston, 1997 

[V4] Vavasis, S. A., Nonlinear Optimization: Complexity Issues, Oxford Science, New 
York, N.Y., 1991 

[V5] Veinott, A. F., Jr., “The Supporting Hyperplane Method for Unimodal Programming,” 
Operations Research XV, 1, 1967, pages 147-152 

[V6] Vorobyev, Y. V., Methods of Moments in Applied Mathematics, Gordon and Breach, 
New York, 1965 

[W1] Wilde, D. J., and Beightler, C. S., Foundations of Optimization, Prentice-Hall, 
Englewood Cliffs, N.J., 1967 

[W2] Wilson, R. B., “A Simplicial Algorithm for Concave Programming,” Ph.D. Disser- 
tation, Harvard University Graduate School of Business Administration, 1963 

[W3] Wolfe, P., “A Duality Theorem for Nonlinear Programming,” Quar. Appl. Math. 19, 
1961, pages 239-244 

[W4] Wolfe, P., “On the Convergence of Gradient Methods Under Constraints,’ IBM 
Research Report RZ 204, Zurich, Switzerland, 1966 

[W5] Wolfe, P., “Methods of Nonlinear Programming,” Chapter 6 of Nonlinear 
Programming, J. Abadie (editor), Interscience, John Wiley, New York, 1967, pages 
97-131 

[W6] Wolfe, P., “Convergence Conditions for Ascent Methods,” SIAM Review 11, 1969, 
pages 226-235 

[W7] Wolfe, P., “Convergence Theory in Nonlinear Programming,” Chapter | in Integer and 


Nonlinear Programming, J. Abadie (editor), North-Holland Publishing Company, 
Amsterdam, 1970 


Bibliography 539 


[W8] Wright, S. J., Primal-Dual Interior-Point Methods, SIAM, Philadelphia, 1996 


[Y1] 


[Y2] 


Ye, Y., “An O(n3L) Potential Reduction Algorithm for Linear Programming,” Math. 
Prog. 50, 1991, pages 239-258 


Ye, Y., Todd, M. J., and Mizuno, S., “An O(,/nL)-Iteration Homogeneous and Self— 
dual Linear Programming Algorithm,” Math. Oper. Res. 19, 1994, pages 53-67 


Ye, Y., Interior Point Algorithms, Wiley, New York, 1997 


Zangwill, W. I., “Nonlinear Programming via Penalty Functions,” Management 
Science 13, 5, 1967, pages 344-358 


Zangwill, W. 1. Nonlinear Programming: A Unified Approach, Prentice-Hall, 
Englewood Cliffs, N.J., 1969 


Zhang, Y., and Zhang, D., “On Polynomiality of the Mehrotra-type Predictor-corrector 
Interior-point Algorithms,” Math. Prog., 68, 1995, pages 303-317 


Zoutendijk, G., Methods of Feasible Directions, Elsevier, Amsterdam, 1960 


Absolute-value penalty function, 426, 
428-429, 454, 481-482, 483, 499 
Active constraints, 87, 322-323 
Active set methods, 363-364, 366-367, 
396, 469, 472, 484, 498-499 
convergence properties of, 364 
Active set theorem, 366 
Activity space, 41, 87 
Adjacent extreme points, 38-42 
Aitken 5* method, 261 
Aitken double sweep method, 253 
Algorithms 
interior, 111-140, 250-252, 373, 
406-407, 487-499 
iterative, 6-7, 112, 183, 201, 210-212 
line search, 215-233, 240-242, 247 
maximal flow, 170, 172-174, 178 
path-following, 131 
polynomial time, 7, 112, 114 
simplex, 33-70, 114-115 
transportation, 145, 156, 160 
Analytic center, 112, 118-120, 122-125, 
126, 139 
Arcs, 160-170, 172-173, 175 
artificial, 166 
basic, 165 
See also Nodes 
Armijo rule, 230, 232-233, 240 
Artificial variables, 50-53, 57 
Assignment problem, 159-160, 162, 175 
Associated restricted dual, 94-97 
Associated restricted primal, 93-97 
Asymptotic convergence, 241, 279, 
364, 375 


INDEX 


Augmented Lagrangian methods, 451-456, 
458-459 

Augmenting path, 170, 172-173 

Average convergence ratio, 210-211 

Average rates, 210 


Back substitution, 151-154, 524 

Backtracking, 233, 248-252 

Barrier functions, 112, 248, 250-251, 257, 
405-407, 412-414 

Barrier methods, 127-139, 401, 405-408, 
412-414, 469, 472 

See also Penalty methods 

Barrier problem, 121, 125, 128, 130, 402, 
418, 429, 488 

Basic feasible solution, 20-25 

Basic variables, 19-20 

Basis Triangularity theorem, 151-153, 159 

Big-M method, 74, 108, 136, 139 

Bland’s rule, 49 

Block-angular structure, 62 

Bordered Hessian Test, 339 

Broyden family, 293-295, 302, 305 

Broyden-Fletcher-Goldfarb-Shanno 
update, 294 

Broyden method, 294-297, 300, 302, 305 

Bug, 374-375, 387-388 


Canonical form, 34-38 

Canonical rate, 7, 376, 402, 417, 
420-423, 425, 429, 447, 467, 
485-486, 499 

Capacitated networks, 168 

Caterer problem, 177 

Cauchy-Schwarz inequality, 291 


542 Index 


Central path, 121-139, 414 
dual, 124 
primal-dual, 125-126, 489 
Chain, 161 
hanging, 330-331, 391-394, 449-451 
Cholesky factorization, 250 
Closed mappings, 203 
Closed set, 511 
Combinatorial auction, 17 
Compact set, 512 
Complementary formula, 293 
Complementary slackness, 88-90, 93, 
95, 127, 137, 174, 342, 351, 440, 
470, 483 
Complexity theory, 6-7, 112, 114, 130, 
133, 136, 212, 491, 498 
Concave functions, 183, 192 
Condition number, 239 
Conjugate direction method, 263, 283 
algorithm, 263-268 
descent properties of, 266 
theorem, 265 
Conjugate gradient method, 268-283, 
290, 293, 296-297, 300, 304-306, 
312, 394, 418-419, 458, 475 
algorithm, 269-270, 277, 419 
non-quadratic, 277-279 
paritial, 273-276, 420-421 
PARTAN, 279-281 
theorem, 270-271, 273, 278 
Conjugate residual method, 501 
Connected graph, 161 
Constrained problems, 2, 4 
Constraints 
active, 322, 342, 344 
inactive, 322, 363 
inequality, 11 
nonnegativity, 15 
quadratic, 495-496 
redundant, 98, 119 
Consumer surplus, 332 
Control example, 189 
Convergence 
analysis, 6-7, 212, 226, 236, 245, 
251-252, 254, 274, 279, 287, 
313, 395, 484-485 
average order of, 208, 210 
canonical rate of, 7, 376, 402, 417, 
420-425, 447, 485-486 


of descent algorithms, 201-208 

dual canonical rate of, 447, 446-447 
of ellipsoid method, 117-118 
geometric, 209 

global theorem, 206 

linear, 209 

of Newton’s method, 246-247 
order, 208 

of partial C-G, 273, 275 


of penalty and barrier methods, 403-405, 


407, 418-419, 420-425 
of quasi-Newton, 292, 296-299 
rate, 209 
ratio, 209 
speed of, 208 
of steepest descent, 238-241 
superlinear, 209 
theory of, 208, 392 
of vectors, 211 
Convex cone, 516-517 
Convex duality, 435-452 
Convex functions, 192-197 
Convex hull, 516, 522 
Convex polyhedron, 24, 111, 522 
Convex polytope, 519 
Convex programing problem, 346 
Convex Sets, 515 
theory of, 22-23 
Convex simplex method, 395-396 
Coordinate descent, 253-255 
Cubic fit, 224-225 
Curve fitting, 219-233 
Cutting plane methods, 435, 460 
Cycle, in linear programming, 49, 72, 
76-77 
Cyclic coordinate descent, 253 


Damping, 247-248, 250-252 

Dantzig-Wolfe decomposition method, 
62-70, 77 

Data analysis procedures, 333 


Davidon-Fletcher-Powell method, 290-292, 


294-295, 298-302, 304 
Decision problems, 333 
Decomposition, 448-451 

LU, 59-62, 523-526 

LP, 62-70, 77 
Deflected gradients, 286 
Degeneracy, 39, 49, 158 


Descent 
algorithms, 201 
function, 203 
DFP, see Davidon-Fletcher-Powell 
method 
Diet problem, 14-15, 45, 81, 89-90, 92 
dual of, 81 
Differentiable convex functions, Properties 
of, 194 
Directed graphs, 161 
Dual canonical convergence rate, 435, 
446-447 
Dual central path, 123, 125 
Dual feasible, 87, 91 
Dual function defined, 437, 443, 446, 458 
Duality, 79-98, 121, 126-127, 173, 
435, 441, 462, 494-497 
asymmetric form of, 80 
gap, 126-127, 130-131, 436, 438, 
440-441, 497 
local, 441-446 
theorem, 82-85, 90, 173, 327, 338, 444 
Dual linear program, 79, 494 
Dual simplex method, 90-93, 127 


Economic interpretation 
of Dantzig—Wolfe decomposition, 66 
of decomposition, 449 
of primal—dual algorithm, 95—96 
of relative cost coefficients, 45-46 
of simplex multipliers, 94 
Eigenvalues, 116-117, 237-239, 241-246, 
248-250, 257, 272-276, 279, 286-287, 
293, 297, 300-303, 306, 311, 378-379, 
388-390, 392-394, 410-412, 416-418, 
420-423, 446-447, 457-458, 486-487, 
510-511 
in tangent space, 335-339, 374-376 
interlocking, 300, 302, 393 
steepeset descent rate, 237-239 
Eigenvector, 510 
Ellipsoid method, 112, 115-118, 139 
Entropy, 329 
Epigraph, 199, 348, 351, 354, 437 
Error 
function, 211, 238 
tolerance, 372 
Exact penalty theorem, 427 


Index 543 


Expanding subspace theorem, 266-268, 
270, 272, 281 

Exponential density, 330 

Extreme point, 23, 521 


False position method, 221—222 

Feasible direction methods, 360 

Feasible solution, 20-22 

Fibonacci search method, 216-219, 224 

First-order necessary conditions, 184-185, 
197, 326-342 

Fletcher-Reeves method, 278, 281 

Free variables, 13, 53 

Full rank assumption, 19 


Game theory, 105 
Gaussian elimination, 34, 60-61, 150-151, 
212, 523, 526 
Gauss-Southwell method, 253-255, 395 
Generalized reduced gradient method, 385 
Geodesics, 374-380 
Geometric convergence, 209 
Global Convergence, 201-208, 226-228, 
253, 279, 296, 385, 404, 466, 470, 
478, 483 
theorem, 206 
Global duality, 435-441 
Global minimum points, 184 
Golden section ratio, 217-219 
Goldstein test, 232—233 
Gradient projection method, 367-373, 
421-423, 425 
convergence rate of the, 374-382 
Graph, 204 


Half space, 518-519 

Hanging chain, 330-331, 332, 390-394, 
449-451 

Hessian of dual, 443 

Hessian of the Lagrangian, 335-345, 
359, 374, 376, 389, 392, 396, 
411-412, 429, 442-445, 450, 456, 
472-474, 476, 480-481 

Hessian matrix, 190-191, 196, 241, 
288-293, 321, 339, 344, 376, 390, 
410-414, 420, 458, 473, 495 

Homogeneous self-dual algorithm, 136 

Hyperplanes, 517-521 


544 Index 


Implicit function theorem, 325, 341, 347, 
513-514 

Inaccurate line search, 230-233 

Incidence matrix, 161 

Initialization, 134-135 

Interior-point methods, 111-139, 406, 469, 
487-491 

Interlocking Eigenvalues Lemma, 243, 
300-302, 393 

Iterative algorithm, 6-7, 112, 183, 201, 210, 
212, 256 


Jacobian matrix, 325, 340, 488, 514 
Jamming, 362-363 


Kantorovich inequality, 237-238 
Karush-Kuhn-Tucker conditions, 5, 
342, 369 
Kelley’s convex cutting plane algorithm, 
463-465 
Khachiyan’s ellipsoid method, 112, 115 


Lagrange multiplier, 5, 121-122, 327, 340, 
344, 345, 346-353, 444-446, 452, 456, 
460, 483-484, 517 

Levenberg-Marquardt type methods, 250 

Limit superior, 512 

Line search, 132, 215-233, 240-241, 248, 
253-254, 278-279, 291, 302, 304-305, 
312, 360, 362, 395, 466 

Linear convergence, 209 

Linear programing, 2, 11-175, 338, 493 

examples of, 14-19 
fundamental theorem of, 20-22 

Linear variety, 517 

Local convergence, 6, 226, 235, 254, 297 

Local duality, 441, 451, 456 

Local minimum point, 184 

Logarithmic barrier method, 119, 250-251, 
406, 418, 488 

LU decomposition, 59-60, 62, 523-526 


Manufacturing problem, 16, 95 
Mappings, 201-205 

Marginal price, 89-90, 340 
Marriage problem, 177 

Master problem, 63-67 

Matrix notation, 508 


Max Flow-Min Cut theorem, 172-173 
Maximal flow, 145, 168-175 

Mean value theorem, 513 
Memoryless quasi-Newton method, 
304-306 

Minimum point, 184 

Morrison’s method, 431 

Multiplier methods, 451, 458 


Newton’s method, 121, 128, 131, 140, 215, 
219-222, 246-254, 257, 285-287, 
306-312, 416, 422-425, 429, 463-464, 
476-481, 484, 486, 499 

modified, 285—287, 306, 417, 429, 479, 
481, 483, 486 

Node-are incidence matrix, 161-162 

Nodes, 160 

Nondegeneracy, 20, 39 

Nonextremal variable, 100-102 

Normalizing constraint, 136-137 

Northwest corner rule, 148-150, 152, 155, 

157-158 

Null value theorem, 100 

Null variables, 99-100 


Oil refinery problem, 28 

Optimal control, 5, 322, 355, 391 
Optimal feasible solution, 20-22, 33 
Order of convergence, 209-211 
Orthogonal complement, 422, 510 
Orthogonal matrix, 51 


Parallel tangents method, see PARTAN 
Parimutuel auction, 17 
PARTAN, 279-281, 394 
advantages and disadvantages of, 281 
theorem, 280 
Partial conjugate gradient method, 
273-276, 429 
Partial duality, 446 
Partial quasi-Newton method, 296 
Path-following, 126, 127, 129, 130, 131, 
374, 472, 499 
Penalty functions, 243, 275, 402-405, 
407-412, 416-429, 451, 453, 460, 
472, 481-487 
interpretation of, 415 
normalization of, 420 


Percentage test, 230 
Pivoting, 33, 35-36 
Pivot transformations, 54 
Point-to-set mappings, 202-205 
Polak-Ribiere method, 278, 305 
Polyhedron, 63, 519 
Polynomial time, 7, 112, 114, 134, 
139, 212 
Polytopes, 517 
Portfolio analysis, 332 
Positive definite matrix, 511 
Potential function, 119-120, 
131-132, 139-140, 472, 
490-491, 497 
Power generating example, 188-189 
Preconditioning, 306 
Predictor-corrector method, 130, 140 
Primal central path, 122, 124, 126-127 
Primal-dual 
algorithm for LP, 93-95 
central path, 125-126, 130, 
472, 488 
methods, 93-94, 96, 160, 469-499 
optimality theorem, 94 
path, 125-127, 129 
potential function, 131-132, 497 
Primal function, 347, 350-351, 354, 
415-416, 427-428, 436-437, 
440, 454-455 
Primal method, 359-396, 447 
advantage of, 359 
Projection matrix, 368 
Purification procedure, 134 


Quadratic 
approximation, 277 
fit method, 225 
minimization problem, 264, 271 
penalty function, 484 
penalty method, 417-418 
program, 351, 439, 470, 472, 475, 
478, 480, 489, 492 
Quasi-Newton methods, 285-306, 312 
memoryless, 304—305 


Rank, 509 

Rank-one correction, 288—290 
Rank-reduction procedure, 492 
Rank-two correction, 290 


Index 545 


Rate of convergence, 209-211, 378, 485 
Real number 
arithmetic model, 114 
sets of, 507 
Recursive quadratic programing, 472, 479, 
481, 483, 485, 499 
Reduced cost coefficients, 45 
Reduced gradient method, 382-396 
convergence rate of the, 387 
Redundant equations, 98—99 
Relative cost coefficients, 45, 88 
Requirements space, 41, 86-87 
Revised simplex method, 56-60, 62, 
64-65, 67, 88, 145, 153, 156, 165 
Robust set, 405 


Scaling, 243-247, 279, 298-304, 
306, 447 
Search by golden section, 217 
Second-order conditions, 190-192, 
333-335, 344-345, 444 
Self-concordant function, 251-252 
Self-dual linear program, 106, 136 
Semidefinite programing (SDP), 
491-498 
Sensitivity, 88-89, 339-341, 345, 415 
Sensor localization, 493 
Separable problem, 447-451 
Separating hyperplane theorem, 82, 199, 
348, 521 
Sets, 507, 515 
Sherman-Morrison formula, 294, 457 
Simple merit function, 472 
Simplex method, 33-70 
for dual, 90-93 
and dual problem, 93-98 
and LU decomposition, 59-62 
matrix form of, 54 
for minimum cost flow, 165 
revised, 56-59 
for transportation problems, 153-159 
Simplex multipliers, 64, 88-90, 
153-154, 157-158, 165-166, 174 
Simplex tableau, 45-47 
Slack variables, 12 
Slack vector, 120, 137 
Slater condition, 350-351 
Spacer step, 255-257, 279, 296 


546 Index 


Steepest descent, 233-242, 276, 306-312, 


367-394, 446-447 
applications, 242—246 

Stopping criterion, 230-233, 240-241 

See also Termination 
Strong duality theorem, 439, 497 
Superlinear convergence, 209-210 
Support vector machines, 17 
Supporting hyperplane, 521 
Surplus variables, 12-13 
Synthetic carrot, 45 


Tableau, 45-47 
Tangent plane, 323-326 
Taylor’s Theorem, 513 
Termination, 134-135 
Transportation problem, 15-16, 
81-82, 145-159 
dual of, 81-82 
simplex method for, 153-159 
Transshipment problem, 163 
Tree algorithm, 145, 166, 169-170, 
172, 175 
Triangular bases, 151, 154 
Triangularity, 150 


Triangularization procedure, 151, 164 
Triangular matrices, 150, 525 
Turing model of computation, 113 


Unimodal, 216 
Unimodular, 175 
Upper triangular, 525 


Variable metric method, 290 


Warehousing problem, 16 
Weak duality 
lemma, 83, 126 
proposition, 437 
Wolfe Test, 233 
Working set, 364-368, 370, 383, 396 
Working surface, 364-369, 371, 383 


Zero-duality gap, 126-127 
Zero-order 
conditions, 198—200, 346-354 
Lagrange theorem, 352, 439 
Zigzagging, 362, 367 
Zoutendijk method, 361 


Early titles in the 
INTERNATIONAL SERIES IN OPERATIONS RESEARCH 
& MANAGEMENT SCIENCE 


Frederick S. Hillier, Series Editor, Stanford University 


Saigal/ A MODERN APPROACH TO LINEAR PROGRAMMING 

Nagurney/ PROJECTED DYNAMICAL SYSTEMS & VARIATIONAL INEQUALITIES WITH 
APPLICATIONS 

Padberg & Rijal/ LOCATION, SCHEDULING, DESIGN AND INTEGER PROGRAMMING 

Vanderbei/ LINEAR PROGRAMMING 

Jaiswal/ MILITARY OPERATIONS RESEARCH 

Gal & Greenberg/ ADVANCES IN SENSITIVITY ANALYSIS & PARAMETRIC PROGRAMMING 

Prabhu/ FOUNDATIONS OF QUEUEING THEORY 

Fang, Rajasekera & Tsao/ ENTROPY OPTIMIZATION & MATHEMATICAL PROGRAMMING 

Yu/ OR IN THE AIRLINE INDUSTRY 

Ho & Tang/ PRODUCT VARIETY MANAGEMENT 

El-Taha & Stidham/ SAMPLE-PATH ANALYSIS OF QUEUEING SYSTEMS 

Miettinen/ NONLINEAR MULTIOBJECTIVE OPTIMIZATION 

Chao & Huntington/ DESIGNING COMPETITIVE ELECTRICITY MARKETS 

Weglarz/ PROJECT SCHEDULING: RECENT TRENDS & RESULTS 

Sahin & Polatoglu/ QUALITY, WARRANTY AND PREVENTIVE MAINTENANCE 

Tavares/ ADVANCES MODELS FOR PROJECT MANAGEMENT 

Tayur, Ganeshan & Magazine/ QUANTITATIVE MODELS FOR SUPPLY CHAIN MANAGEMENT 

Weyant, J./ EVERGY AND ENVIRONMENTAL POLICY MODELING 

Shanthikumar, J.G. & Sumita, U/ APPLIED PROBABILITY AND STOCHASTIC PROCESSES 

Liu, B. & Esogbue, A.O./ DECISION CRITERIA AND OPTIMAL INVENTORY PROCESSES 

Gal, T., Stewart, T.J., Hanne, T. / MULTICRITERIA DECISION MAKING: Advances in MCDM 
Models, Algorithms, Theory, and Applications 

Fox, B.L. / STRATEGIES FOR QUASI-MONTE CARLO 

Hall, R.W. / HANDBOOK OF TRANSPORTATION SCIENCE 

Grassman, W.K. / COMPUTATIONAL PROBABILITY 

Pomerol, J-C. & Barba-Romero, S./ MULTICRITERION DECISION IN MANAGEMENT 

Axsater, S./ INVENTORY CONTROL 

Wolkowicz, H., Saigal, R., & Vandenberghe, L./ HANDBOOK OF SEMI-DEFINITE 
PROGRAMMING: Theory, Algorithms, and Applications 

Hobbs, B.F. & Meier, P. / ENERGY DECISIONS AND THE ENVIRONMENT: A Guide 

to the Use of Multicriteria Methods 

Dar-El, E. / HUMAN LEARNING: From Learning Curves to Learning Organizations 

Armstrong, J.S. / PRINCIPLES OF FORECASTING: A Handbook for Researchers and Practitioners 

Balsamo, S., Personé, V., & Onvural, R./ ANALYSIS OF QUEUEING NETWORKS WITH BLOCKING 

Bouyssou, D. et al. / EVALUATION AND DECISION MODELS: A Critical Perspective 

Hanne, T. / INTELLIGENT STRATEGIES FOR META MULTIPLE CRITERIA DECISION MAKING 

Saaty, T. & Vargas, L./ MODELS, METHODS, CONCEPTS and APPLICATIONS OF THE 
ANALYTIC HIERARCHY PROCESS 

Chatterjee, K. & Samuelson, W. / GAME THEORY AND BUSINESS APPLICATIONS 

Hobbs, B. et al. / THE NEXT GENERATION OF ELECTRIC POWER UNIT COMMITMENT MODELS 

Vanderbei, R.J. / LINEAR PROGRAMMING: Foundations and Extensions, 2nd Ed. 

Kimms, A. / MATHEMATICAL PROGRAMMING AND FINANCIAL OBJECTIVES FOR 
SCHEDULING PROJECTS 

Baptiste, P., Le Pape, C. & Nuijten, W. / CONSTRAINT-BASED SCHEDULING 

Feinberg, E. & Shwartz, A. / HANDBOOK OF MARKOV DECISION PROCESSES: Methods 
and Applications 


Early titles in the 

INTERNATIONAL SERIES IN OPERATIONS RESEARCH 
& MANAGEMENT SCIENCE 

(Continued) 


Ramik, J. & Vlach, M. / GENERALIZED CONCAVITY IN FUZZY OPTIMIZATION AND DECISION 
ANALYSIS 

Song, J. & Yao, D. / SUPPLY CHAIN STRUCTURES: Coordination, Information and Optimization 

Kozan, E. & Ohuchi, A. / OPERATIONS RESEARCH/ MANAGEMENT SCIENCE AT WORK 

Bouyssou et al. / AIDING DECISIONS WITH MULTIPLE CRITERIA: Essays in Honor of Bernard Roy 

Cox, Louis Anthony, Jr. / RISK ANALYSIS: Foundations, Models and Methods 

Dror, M., L’Ecuyer, P. & Szidarovszky, F. / MODELING UNCERTAINTY: An Examination of 
Stochastic Theory, Methods, and Applications 

Dokuchaev, N. / DYNAMIC PORTFOLIO STRATEGIES: Quantitative Methods and Empirical Rules 
for Incomplete Information 

Sarker, R., Mohammadian, M. & Yao, X. / EVOLUTIONARY OPTIMIZATION 

Demeulemeester, R. & Herroelen, W. / PROJECT SCHEDULING: A Research Handbook 

Gazis, D.C. / TRAFFIC THEORY 

Zhu/ QUANTITATIVE MODELS FOR PERFORMANCE EVALUATION AND BENCHMARKING 

Ehrgott & Gandibleux/ MULTIPLE CRITERIA OPTIMIZATION: State of the Art Annotated 
Bibliographical Surveys 

Bienstock/ Potential Function Methods for Approx. Solving Linear Programming Problems 

Matsatsinis & Siskos/ INTELLIGENT SUPPORT SYSTEMS FOR MARKETING DECISIONS 

Alpern & Gal/ THE THEORY OF SEARCH GAMES AND RENDEZVOUS 

Hall/HANDBOOK OF TRANSPORTATION SCIENCE - 2" Ed. 

Glover & Kochenberger/ HANDBOOK OF METAHEURISTICS 

Graves & Ringuest/ MODELS AND METHODS FOR PROJECT SELECTION: 

Concepts from Management Science, Finance and Information Technology 

Hassin & Haviv/ TO QUEUE OR NOT TO QUEUE: Equilibrium Behavior in Queueing Systems 

Gershwin et al/ ANALYSIS & MODELING OF MANUFACTURING SYSTEMS 

Maros/ COMPUTATIONAL TECHNIQUES OF THE SIMPLEX METHOD 

Harrison, Lee & Neale/ THE PRACTICE OF SUPPLY CHAIN MANAGEMENT: Where Theory 
and Application Converge 

Shanthikumar, Yao & Zijm/ STOCHASTIC MODELING AND OPTIMIZATION OF 
MANUFACTURING SYSTEMS AND SUPPLY CHAINS 

Nabrzyski, Schopf & Weglarz/ GRID RESOURCE MANAGEMENT: State of the Art and Future Trends 

Thissen & Herder/ CRITICAL INFRASTRUCTURES: State of the Art in Research and Application 

Carlsson, Fedrizzi, & Fullér/ FUZZY LOGIC IN MANAGEMENT 

Soyer, Mazzuchi & Singpurwalla/ MATHEMATICAL RELIABILITY: An Expository Perspective 

Chakravarty & Eliashberg/ MANAGING BUSINESS INTERFACES: Marketing, Engineering, and 
Manufacturing Perspectives 

Talluri & van Ryzin/ THE THEORY AND PRACTICE OF REVENUE MANAGEMENT 

Kavadias & Loch/PROJECT SELECTION UNDER UNCERTAINTY: Dynamically Allocating 
Resources to Maximize Value 

Brandeau, Sainfort & Pierskallas OPERATIONS RESEARCH AND HEALTH CARE: A Handboo of 
Methods and Applications 

Cooper, Seiford & Zhu/ HANDBOOK OF DATA ENVELOPMENT ANALYSIS: Models and Methods 

Luenberger/ LINEAR AND NONLINEAR PROGRAMMING, 2"4 Ed. 

Sherbrooke/ OPTIMAL INVENTORY MODELING OF SYSTEMS: Multi-Echelon Techniques, Second 
Edition 

Chu, Leung, Hui & Cheung/ 4th PARTY CYBER LOGISTICS FOR AIR CARGO 

Simchi-Levi, Wu & Shen/ HANDBOOK OF QUANTITATIVE SUPPLY CHAIN ANALYSIS: Modeling 
in the E-Business Era 

Gass & Assad/ AN ANNOTATED TIMELINE OF OPERATIONS RESEARCH: An Informal History 


Early titles in the 

INTERNATIONAL SERIES IN OPERATIONS RESEARCH 
& MANAGEMENT SCIENCE 

(Continued) 


Greenberg/ TUTORIALS ON EMERGING METHODOLOGIES AND APPLICATIONS 
IN OPERATIONS RESEARCH 

Weber/ UNCERTAINTY IN THE ELECTRIC POWER INDUSTRY: Methods and Models for Decision 
Support 

Figueira, Greco & Ehrgott/ MULTIPLE CRITERIA DECISION ANALYSIS: State of the Art Surveys 

Reveliotis/ REAL-TIME MANAGEMENT OF RESOURCE ALLOCATIONS SYSTEMS: A Discrete 
Event Systems Approach 

Kall & Mayer/ STOCHASTIC LINEAR PROGRAMMING: Models, Theory, and Computation 


*A list of the more recent publications in the series is at the front of the book * 


